Volume 33, Issue 1 e5814
SPECIAL ISSUE PAPER

IPDS: A semantic mediator-based system using Spark for the integration of heterogeneous proteomics data sources

Chaimaa Messaoudi

Corresponding Author

Chaimaa Messaoudi

System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco

Correspondence

Chaimaa Messaoudi, System and Data Engineering Team, Abdelmalek Essaadi University, BP 1818, Tangier 90000, Morocco.

Email: [email protected]

Search for more papers by this author
Rachida Fissoune

Rachida Fissoune

System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco

Search for more papers by this author
Hassan Badir

Hassan Badir

System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco

Search for more papers by this author
First published: 23 May 2020
Citations: 6

Summary

With the constant rise of data volumes in many disciplines, various new Big data management systems have emerged to provide scalable tools for efficient data integration, processing, and analysis. In this article, we provide an overview of biomedical data integration systems focusing on ontology-based semantic systems and Big data technologies based systems such as Apache Spark. We also propose a new semantic data integration system, called Integrated Proteomics Data System (IPDS), which uses a mediator approach. IPDS provides users a unified interface for query processing and data exploration. This system takes advantage of the Apache Spark framework to perform the query transformation and execution needed to question the integrated data sources. We develop a domain ontology that allows the user to formulate its queries in terms defined in the ontology. IPDS is a case study of semantic proteomics data integration linking four data sources UniProt (protein annotation), String (protein-protein interaction), PDB (protein structure), and Pubmed (biomedical citation).

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.