Biological Database Integration
Zina Ben Miled
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorYang Liu
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorNianhua Li
Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana
Search for more papers by this authorOmran Bukhres
Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana
Search for more papers by this authorZina Ben Miled
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorYang Liu
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorNianhua Li
Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana
Search for more papers by this authorOmran Bukhres
Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana
Search for more papers by this authorAbstract
Many biomedical databases have rapidly become a central part of knowledge discovery in numerous fields including genomics, proteomics, and drug design. These databases are distributed, heterogeneous, and autonomous. This article by presenting a brief overview of some widely used databases to motivate the need for the integration of the distributed biological databases. Supporting this integration faces numerous challenges including various levels of heterogeneity, limited accessibility and redundancy, or conflicts in the data. These challenges are discussed, and different approaches to addressing these issues are presented. The article also includes a comparison among some integration systems that have been proposed for biological databases.
Bibliography
- 1A. D. Baxevanis, The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res. 2003; 31(1): 1–12.
- 2A. D. Baxevanis, The Molecular Biology Database Collection: An updated compilation of biological database resources. Nucleic Acids Res. 2001; 29(1): 1–10.
- 3P. Karp, A strategy for database interoperation. J. Computat. Biol. 1995; 2(4): 573–586.
- 4S. Schulze-Kremer, Integrating and exploiting large-scale, heterogeneous and autonomous databases with an ontology for molecular biology. In: R. Hofestädt and H. Lim, eds. Molecular Bioinformatics, Sequence Analysis—The Human Genome Project. Aachen: Shaker Verlag, 1997, pp. 43–56.
- 5W. Sujansky, Heterogeneous database integration in biomedicine. J. Biomed. Inform. 2001; 34(4): 285–298.
- 6V. M. Markowitz and O. Ritter, Characterizing heterogeneous molecular biology database systems. J. Comput. Biol. 1995; 2(4): 547–556.
- 7A. C. Siepel, A. N. Tolopko, A. D. Farmer, P. A. Steadman, F. D. Schilkey, B. D. Perry, and W. D. Beavis, An integration platform for heterogeneous bioinformatics software components. IBM Syst. J. 2001; 40(2): 570–591.
- 8D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, GenBank. Nucleic Acids Res. 2002; 30(1): 17–20.
- 9G. Stoesser, W. Baker, A. Broek, M. Garcia-Pastor, C. Kanz, T. Kulikova, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, R. Mancuso, F. Nardone, P. Stoehr, M. A. Tuli, K. Tzouvara, and R. Vaughan, The EMBL Nucleotide Sequence Database: Major new developments. Nucleic Acids Res. 2003; 31(1): 17–22.
- 10Y. Tateno, K. Fukami-Kobayashi, S. Miyazaki, H. Sugawara, and T. Gojobori, DNA Data Bank of Japan at work on genome sequence data. Nucleic Acids Res. 1998; 26(1): 16–20.
- 11C. H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Z. Hu, R. S. Ledley, K. C. Lewis, H. W. Mewes, B. C. Orcutt, B. E. Suzek, A. Tsugita, C. R. Vinayaka, L. S. Yeh, J. Zhang, and W. C. Barker, The Protein Information Resource: An integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002; 30(1): 35–37.
- 12B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003; 31(1): 365–370.
- 13V. A. McKusick, Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, 12th ed. Baltimore, MD: Johns Hopkins University Press, 1998.
- 14J. Westbrook, Z. Feng, S. Jain, T. N. Bhat, N. Thanki, V. Ravichandran, G. L. Gilliland, W. Bluhm, H. Weissig, D. S. Greer, P. E. Bourne, and H. M. Berman, The Protein Data Bank: Unifying the archive. Nucleic Acids Res. 2002; 30(1): 245–248.
- 15A. Bateman, E. Birney, R. Durbin, S. R. Eddy, K. L. Howe, and E. L. Sonnhammer, The Pfam protein families database. Nucleic Acids Res. 2000; 28(1): 263–266.
- 16J. McEntyre and D. Lipman, PubMed: Bridging the information gap. CMAJ. 2001; 164(9): 1317–1319.
- 17D. L. Wheeler, D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner, Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003; 31(1): 28–33.
- 18S. I. Letovsky, R. W. Cottingham, C. J. Porter, and P. W. Li, GDB: The Human Genome Database. Nucleic Acids Res. 1998; 26(1): 94–99.
- 19J. A. Blake, J. T. Eppig, J. E. Richardson, and M. T. Davisson, The Mouse Genome Database (MGD): Expanding genetic and genomic resources for the laboratory mouse. Nucleic Acids Res. 2000; 28(1): 108–111.
- 20S. Y. Rhee, W. Beavis, T. Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Garcia-Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L. A. Mueller, S. Mundodi, L. Reiser, J. Tacklind, D. C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon, and P. Zhang, The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003; 31(1): 224–228.
- 21The FlyBase Consortium, The FlyBase Database of the Drosophila Genome Projects and community literature. Nucleic Acids Res. 2003; 31(1):172–175.
- 22S. Weng, Q. Dong, R. Balakrishnan, K. Christie, M. Costanzo, K. Dolinski, S. S. Dwight, S. Engel, D. G. Fisk, E. Hong, L. Issel-Tarver, A. Sethuraman, C. Theesfeld, R. Andrada, G. Binkley, C. Lane, M. Schroeder, D. Botstein, and J. Michael Cherry, Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res. 2003; 31(1): 216–218.
- 23I. Schomburg, A. Chang, and D. Schomburg, BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002; 30(1): 47–49.
- 24P. D. Karp, M. Riley, S. M. Paley, and A. Pellegrini-Toole, The MetaCyc Database. Nucleic Acids Res. 2002; 30(1): 59–61.
- 25R. Robbins, Report of the invitational DOE workshop on genome informatics, 26–27 April 1993; Genome informatics I: Community databases. J. Computat. Biol. 1994; 1(3): 173–190.
10.1089/cmb.1994.1.173 Google Scholar
- 26H. Bono, H. Ogata, S. Goto, and M. Kanehisa, Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 1998; 8: 203–210.
- 27Practical data integration in biopharmaceutical R&D: Strategies and technologies. White paper. 3rd Millennium, Inc., Cambridge, MA. (2002). Available: http://www.3rdmill.com/pdf/3rd_Millennium_Biopharma-data-integration.pdf.
- 28S. B. Davidson, C. Overton, and P. Buneman, Challenges in integrating biological data sources. J. Computat. Biol. 1995; 2(4): 557–572.
- 29M. Y. Becker and I. Rojas, A graph layout algorithm for drawing metabolic pathways. Bioinformatics 2001; 17(5): 461–467.
- 30V. Honavar, A. Silvescu, J. Reinoso-Castillo, C. Andoff, and D. Dobbs, Ontology driven information extraction and knowledge acquisition from heterogeneous, distributed biological data sources. Proc. IJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources, 2001.
- 31P. Buneman, Semi-structured data. Proc. 16th ACM SIGACT—SIGMOD—SIGART Symp. on Principles of Database Systems, 1997: 117–121.
- 32A. Mackey, Relational modeling of biological data: Trees and graphs. The O'Reilly Network. (2002). Available: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html.
- 33R. Hull, Managing semantic heterogeneity in databases: A theoretical perspective. Proc. 16th ACM Symp. on Principles of Database Systems (PODS), 1997: 51–61.
- 34K. Aberer, The use of object-oriented data models for biomolecular databases. Proc. Conf. on Object-Oriented Computing in the Natural Sciences (OOCNS) ’94, Heidelberg, Germany, 1994.
- 35I. A. Chen and V. M. Markowitz, An overview of the object-protocol model (OPM) and OPM data management tools. Inform. Syst. 1995; 20(5): 393–418.
- 36K. H. Fasman, S. I. Letovsky, R. W. Cottingham, and D. T. Kingsbury, Improvements to the GDB Human Genome Data Base. Nucleic Acids Res. 1996; 24(1): 57–63.
- 37D. L. Rubin, F. Shafa, D. E. Oliver, M. Hewett, and R. B. Altman, Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics 2002; 18(suppl. 1): 207–215.
- 38D. Frishman, K. Heumann, A. Lesk, and H. W. Mewes, Comprehensive, comprehensible, distributed and intelligent databases: Current status. Bioinformatics 1998; 14(7): 551–561.
- 39E. Shoop, K. A. Silverstein, J. E. Johnson, and E. F. Retzel, MetaFam: A unified classification of protein families. II. Schema and query capabilities. Bioinformatics 2001; 17(3): 262–271.
- 40C. T. Yu and W. Meng, Principles of Database Query Processing for Advanced Applications. New York: Morgan Kaufmann, 1998.
- 41S. Busse, R. D. Kutsche, U. Leser, and H. Weber, Federated information systems Concepts, terminology and architectures. Technical Report 99-9, Technische Universitat Berlin, 1999.
- 42C. A. Goble, R. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker, M. Peim, and A. Brass, Transparent access to multiple bioinformatics information sources. IBM Syst. J. 2001; 40(2): 532–552.
- 43S. B. Davidson, C. Overton, V. Tannen, and L. Wong, BioKleisli: A digital library for biomedical researchers. J. Digital Libraries 1997; 1(1): 36–53.
- 44S. B. Davidson, J. Crabtree, B. P. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J. Stoeckert, Jr., K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Syst. J. 2001; 40(2): 512–531.
- 45E. M. Zdobnov, R. Lopez, R. Apweiler, and T. Etzold, The EBI SRS server-recent developments. Bioinformatics 2002; 18(2): 368–373.
- 46I-M. A. Chen, A. S. Kosky, V. M. Markowitz, and E. Szeto, Constructing and maintaining scientific database views in the framework of the object-protocol model. Proc. 9th Int. Conf. on Scientific and Statistical Database Management, IEEE, New York: IEEE, 1997: 237–248.
- 47G. Wiederhold, Mediators in the architecture of future information systems. IEEE Computer. 1992; 25(3): 38–42.
- 48L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope, DiscoveryLink: A system for integrated access to life sciences data sources. IBM Syst. J. 2001; 40(2): 489–511.
- 49A. Sahuguet and F. Azavant, Building light-weight wrappers for legacy Web datasources using W4F. Proc. 25th Int. Conf. on Very Large Data Bases, Edinburgh, Scotland, U.K., September 7–10, 1999.
- 50J. Chen, J. B. Anderson, C. DeWeese-Scott, N. D. Fedorova, L. Y. Geer, S. He, D. I. Hurwitz, J. D. Jackson, A. R. Jacobs, G. J. Lanczycki, C. A. Liebert, C. Liu, T. Madej, A. Marchler-Bauer, G. H. Marchler, R. Mazumder, A. N. Nikolskaya, B. S. Rao, A. R. Panchenko, B. A. Shoemaker, V. Simonyan, J. S. Song, P. A. Thiessen, S. Vasudevan, Y. Wang, R. A. Yamashita, J. J. Yin, and S. H. Bryant, MMDB: Entrez's 3D-structure database. Nucleic Acids Res. 2003; 31(1): 474–477.
- 51J. A. Epstein, J. A. Kans, and G. D. Schuler, WWW Entrez: A hypertext retrieval tool for molecular biology. Proc. the 2nd Int. World Wide Web Conf., Chicago, IL October 1994.
- 52T. Etzold, A. Ulyanov, and P. Argos, SRS: Information retrieval system for molecular biology data banks. Methods Enzymol. 1996; 266: 114–128.
- 53Z. Ben Miled, N. Li, G. M. Kellet, B. Sipes, and O. Bukhres, Complex life science multidatabase queries. Proc. IEEE 2002; 90(11): 1754–1763.
- 54S. Schulze-Kremer, Ontologies for molecular biology and bioinformatics. Silico Biol. 2002; 2(3): 179–193.
- 55R. Stevens, C.A. Goble, and S. Bechhofer, Ontology-based knowledge representation for bioinformatics. Brief. Bioinformatics 2000; 1(4): 398–414.
- 56The Gene Ontology Consortium, Gene Ontology: Tool for the unification of biology. Nature Genet. 2000; 25:25–29.