CollaboratorHackathonMar 10, 2024·DBCLS RDF Summit 5, Okinawa, Japan

RDF Summit 5

Overview

RDF Summit is an event organized by the Database Center for Life Science (DBCLS) with the goal of bringing together experts to collaborate, discuss, and ideate on the integration of knowledge graphs and life science data. The fifth edition was held in Okinawa, Japan, with a focus on innovating the integration of knowledge graphs and genome graphs to advance data science and genomics.

I was invited to participate as a collaborator, specifically to work with the InterMine community, to explore how InterMine can evolve to be RDF-native - leveraging semantic web technologies to improve data interoperability, and integrate more naturally with linked open data infrastructures and initiatives.

Projects

Toward an RDF-Native InterMine

InterMine is an open-source data warehouse platform widely used in the life sciences for integrating and querying heterogeneous biological datasets. It has been adopted by major model organism databases and genomics resources, providing a unified query interface over complex biological data. The platform is widely adopted by major model organism databases (such as FlyMine for Drosophila and MouseMine for mice) and large-scale genomics resources.

I brought prior experience to this collaboration, having previously deployed InterMine as a data warehouse for three distinct research communities:

BovineMine (for Bovine genomics)
HymenopteraMine (for Hymenoptera biology)
MaizeMine (for Maize genetics)

The hands-on experience with InterMine's data model, ETL pipelines, and overall architecture informed my contributions to the project.

On the question of 'how do we evolve InterMine to be RDF-native?', this touched on three factors:

Data Model: Rethinking InterMine's internal data model in terms of RDF and OWL, so that the conceptual structure of biological entities and their relationships is expressed semantically rather than relationally. This includes aligning with established biomedical ontologies and vocabularies already used in the semantic web community.
Data Representation: Moving toward native RDF storage and representation of InterMine data, enabling datasets to be exposed as proper linked data and making them directly consumable by RDF-aware tools and pipelines without requiring translation layers.
SPARQL-based services: Building services on top of an RDF-native InterMine that leverage SPARQL as the primary query interface, opening up federated querying capabilities and enabling InterMine databases to participate as first-class citizens in the broader knowledge graph ecosystem.

A key theme throughout was avoiding duplication of effort — the semantic web and biohackathon communities have already produced substantial tooling, ontologies, and infrastructure that InterMine should build on rather than reinvent.

A quick exploration into using RDF to model and represent existing InterMine data is detailed in a BioHackrXiv Preprint: SPARQL services for InterMine databases

SWAT4HCLS 2024 Biohackathon

Hands-on exploration of SPHN Semantic Interoperability Framework for FAIR knowledge graphs