Biological Database Integration

Zina Ben Miled

Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana

Search for more papers by this author

Yang Liu,

Yang Liu

Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana

Search for more papers by this author

Nianhua Li,

Nianhua Li

Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana

Search for more papers by this author

Omran Bukhres,

Omran Bukhres

Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana

Search for more papers by this author

Zina Ben Miled,

Zina Ben Miled

Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana

Search for more papers by this author

Yang Liu,

Yang Liu

Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana

Search for more papers by this author

Nianhua Li,

Nianhua Li

Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana

Search for more papers by this author

Omran Bukhres,

Omran Bukhres

Indiana University Purdue University Indianapolis, Computer and Information Science, Indianapolis, Indiana

Search for more papers by this author

First published: 14 April 2006

https://doi.org/10.1002/9780471740360.ebs0368

Share a link

Email
Wechat
Bluesky

Abstract

Many biomedical databases have rapidly become a central part of knowledge discovery in numerous fields including genomics, proteomics, and drug design. These databases are distributed, heterogeneous, and autonomous. This article by presenting a brief overview of some widely used databases to motivate the need for the integration of the distributed biological databases. Supporting this integration faces numerous challenges including various levels of heterogeneity, limited accessibility and redundancy, or conflicts in the data. These challenges are discussed, and different approaches to addressing these issues are presented. The article also includes a comparison among some integration systems that have been proposed for biological databases.

Bibliography

1A. D. Baxevanis, The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res. 2003; 31(1): 1–12.
10.1093/nar/gkg120
CAS PubMed Web of Science® Google Scholar
2A. D. Baxevanis, The Molecular Biology Database Collection: An updated compilation of biological database resources. Nucleic Acids Res. 2001; 29(1): 1–10.
10.1093/nar/29.1.1
CAS PubMed Google Scholar
3P. Karp, A strategy for database interoperation. J. Computat. Biol. 1995; 2(4): 573–586.
10.1089/cmb.1995.2.573
CAS PubMed Google Scholar
4S. Schulze-Kremer, Integrating and exploiting large-scale, heterogeneous and autonomous databases with an ontology for molecular biology. In: R. Hofestädt and H. Lim, eds. Molecular Bioinformatics, Sequence Analysis—The Human Genome Project. Aachen: Shaker Verlag, 1997, pp. 43–56.
Google Scholar
5W. Sujansky, Heterogeneous database integration in biomedicine. J. Biomed. Inform. 2001; 34(4): 285–298.
10.1006/jbin.2001.1024
CAS PubMed Web of Science® Google Scholar
6V. M. Markowitz and O. Ritter, Characterizing heterogeneous molecular biology database systems. J. Comput. Biol. 1995; 2(4): 547–556.
10.1089/cmb.1995.2.547
CAS PubMed Google Scholar
7A. C. Siepel, A. N. Tolopko, A. D. Farmer, P. A. Steadman, F. D. Schilkey, B. D. Perry, and W. D. Beavis, An integration platform for heterogeneous bioinformatics software components. IBM Syst. J. 2001; 40(2): 570–591.
10.1147/sj.402.0570
Web of Science® Google Scholar
8D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler, GenBank. Nucleic Acids Res. 2002; 30(1): 17–20.
10.1093/nar/30.1.17
CAS PubMed Web of Science® Google Scholar
9G. Stoesser, W. Baker, A. Broek, M. Garcia-Pastor, C. Kanz, T. Kulikova, R. Leinonen, Q. Lin, V. Lombard, R. Lopez, R. Mancuso, F. Nardone, P. Stoehr, M. A. Tuli, K. Tzouvara, and R. Vaughan, The EMBL Nucleotide Sequence Database: Major new developments. Nucleic Acids Res. 2003; 31(1): 17–22.
10.1093/nar/gkg021
CAS PubMed Web of Science® Google Scholar
10Y. Tateno, K. Fukami-Kobayashi, S. Miyazaki, H. Sugawara, and T. Gojobori, DNA Data Bank of Japan at work on genome sequence data. Nucleic Acids Res. 1998; 26(1): 16–20.
10.1093/nar/26.1.16
CAS PubMed Web of Science® Google Scholar
11C. H. Wu, H. Huang, L. Arminski, J. Castro-Alvear, Y. Chen, Z. Z. Hu, R. S. Ledley, K. C. Lewis, H. W. Mewes, B. C. Orcutt, B. E. Suzek, A. Tsugita, C. R. Vinayaka, L. S. Yeh, J. Zhang, and W. C. Barker, The Protein Information Resource: An integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002; 30(1): 35–37.
10.1093/nar/30.1.35
CAS PubMed Web of Science® Google Scholar
12B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003; 31(1): 365–370.
10.1093/nar/gkg095
CAS PubMed Web of Science® Google Scholar
13V. A. McKusick, Mendelian Inheritance in Man. Catalogs of Human Genes and Genetic Disorders, 12th ed. Baltimore, MD: Johns Hopkins University Press, 1998.
Google Scholar
14J. Westbrook, Z. Feng, S. Jain, T. N. Bhat, N. Thanki, V. Ravichandran, G. L. Gilliland, W. Bluhm, H. Weissig, D. S. Greer, P. E. Bourne, and H. M. Berman, The Protein Data Bank: Unifying the archive. Nucleic Acids Res. 2002; 30(1): 245–248.
10.1093/nar/30.1.245
CAS PubMed Web of Science® Google Scholar
15A. Bateman, E. Birney, R. Durbin, S. R. Eddy, K. L. Howe, and E. L. Sonnhammer, The Pfam protein families database. Nucleic Acids Res. 2000; 28(1): 263–266.
10.1093/nar/28.1.263
CAS PubMed Web of Science® Google Scholar
16J. McEntyre and D. Lipman, PubMed: Bridging the information gap. CMAJ. 2001; 164(9): 1317–1319.
CAS PubMed Web of Science® Google Scholar
17D. L. Wheeler, D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner, Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003; 31(1): 28–33.
10.1093/nar/gkg033
CAS PubMed Web of Science® Google Scholar
18S. I. Letovsky, R. W. Cottingham, C. J. Porter, and P. W. Li, GDB: The Human Genome Database. Nucleic Acids Res. 1998; 26(1): 94–99.
10.1093/nar/26.1.94
CAS PubMed Web of Science® Google Scholar
19J. A. Blake, J. T. Eppig, J. E. Richardson, and M. T. Davisson, The Mouse Genome Database (MGD): Expanding genetic and genomic resources for the laboratory mouse. Nucleic Acids Res. 2000; 28(1): 108–111.
10.1093/nar/28.1.108
CAS PubMed Web of Science® Google Scholar
20S. Y. Rhee, W. Beavis, T. Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Garcia-Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L. A. Mueller, S. Mundodi, L. Reiser, J. Tacklind, D. C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon, and P. Zhang, The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 2003; 31(1): 224–228.
10.1093/nar/gkg076
CAS PubMed Web of Science® Google Scholar
21The FlyBase Consortium, The FlyBase Database of the Drosophila Genome Projects and community literature. Nucleic Acids Res. 2003; 31(1):172–175.
Google Scholar
22S. Weng, Q. Dong, R. Balakrishnan, K. Christie, M. Costanzo, K. Dolinski, S. S. Dwight, S. Engel, D. G. Fisk, E. Hong, L. Issel-Tarver, A. Sethuraman, C. Theesfeld, R. Andrada, G. Binkley, C. Lane, M. Schroeder, D. Botstein, and J. Michael Cherry, Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res. 2003; 31(1): 216–218.
10.1093/nar/gkg054
CAS PubMed Web of Science® Google Scholar
23I. Schomburg, A. Chang, and D. Schomburg, BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002; 30(1): 47–49.
10.1093/nar/30.1.47
CAS PubMed Web of Science® Google Scholar
24P. D. Karp, M. Riley, S. M. Paley, and A. Pellegrini-Toole, The MetaCyc Database. Nucleic Acids Res. 2002; 30(1): 59–61.
10.1093/nar/30.1.59
CAS PubMed Web of Science® Google Scholar
25R. Robbins, Report of the invitational DOE workshop on genome informatics, 26–27 April 1993; Genome informatics I: Community databases. J. Computat. Biol. 1994; 1(3): 173–190.
10.1089/cmb.1994.1.173
Google Scholar
26H. Bono, H. Ogata, S. Goto, and M. Kanehisa, Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 1998; 8: 203–210.
10.1101/gr.8.3.203
CAS PubMed Web of Science® Google Scholar
27Practical data integration in biopharmaceutical R&D: Strategies and technologies. White paper. 3rd Millennium, Inc., Cambridge, MA. (2002). Available: http://www.3rdmill.com/pdf/3rd_Millennium_Biopharma-data-integration.pdf.
Google Scholar
28S. B. Davidson, C. Overton, and P. Buneman, Challenges in integrating biological data sources. J. Computat. Biol. 1995; 2(4): 557–572.
10.1089/cmb.1995.2.557
CAS PubMed Google Scholar
29M. Y. Becker and I. Rojas, A graph layout algorithm for drawing metabolic pathways. Bioinformatics 2001; 17(5): 461–467.
10.1093/bioinformatics/17.5.461
CAS PubMed Web of Science® Google Scholar
30V. Honavar, A. Silvescu, J. Reinoso-Castillo, C. Andoff, and D. Dobbs, Ontology driven information extraction and knowledge acquisition from heterogeneous, distributed biological data sources. Proc. IJCAI-2001 Workshop on Knowledge Discovery from Heterogeneous, Distributed, Autonomous, Dynamic Data and Knowledge Sources, 2001.
Google Scholar
31P. Buneman, Semi-structured data. Proc. 16th ACM SIGACT—SIGMOD—SIGART Symp. on Principles of Database Systems, 1997: 117–121.
Google Scholar
32A. Mackey, Relational modeling of biological data: Trees and graphs. The O'Reilly Network. (2002). Available: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html.
Google Scholar
33R. Hull, Managing semantic heterogeneity in databases: A theoretical perspective. Proc. 16th ACM Symp. on Principles of Database Systems (PODS), 1997: 51–61.
Google Scholar
34K. Aberer, The use of object-oriented data models for biomolecular databases. Proc. Conf. on Object-Oriented Computing in the Natural Sciences (OOCNS) ’94, Heidelberg, Germany, 1994.
Google Scholar
35I. A. Chen and V. M. Markowitz, An overview of the object-protocol model (OPM) and OPM data management tools. Inform. Syst. 1995; 20(5): 393–418.
10.1016/0306-4379(95)00021-U
Web of Science® Google Scholar
36K. H. Fasman, S. I. Letovsky, R. W. Cottingham, and D. T. Kingsbury, Improvements to the GDB Human Genome Data Base. Nucleic Acids Res. 1996; 24(1): 57–63.
10.1093/nar/24.1.57
CAS PubMed Web of Science® Google Scholar
37D. L. Rubin, F. Shafa, D. E. Oliver, M. Hewett, and R. B. Altman, Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics 2002; 18(suppl. 1): 207–215.
10.1093/bioinformatics/18.suppl_1.S207
PubMed Web of Science® Google Scholar
38D. Frishman, K. Heumann, A. Lesk, and H. W. Mewes, Comprehensive, comprehensible, distributed and intelligent databases: Current status. Bioinformatics 1998; 14(7): 551–561.
10.1093/bioinformatics/14.7.551
CAS PubMed Web of Science® Google Scholar
39E. Shoop, K. A. Silverstein, J. E. Johnson, and E. F. Retzel, MetaFam: A unified classification of protein families. II. Schema and query capabilities. Bioinformatics 2001; 17(3): 262–271.
10.1093/bioinformatics/17.3.262
CAS PubMed Web of Science® Google Scholar
40C. T. Yu and W. Meng, Principles of Database Query Processing for Advanced Applications. New York: Morgan Kaufmann, 1998.
Google Scholar
41S. Busse, R. D. Kutsche, U. Leser, and H. Weber, Federated information systems Concepts, terminology and architectures. Technical Report 99-9, Technische Universitat Berlin, 1999.
Google Scholar
42C. A. Goble, R. Stevens, G. Ng, S. Bechhofer, N. W. Paton, P. G. Baker, M. Peim, and A. Brass, Transparent access to multiple bioinformatics information sources. IBM Syst. J. 2001; 40(2): 532–552.
10.1147/sj.402.0532
Web of Science® Google Scholar
43S. B. Davidson, C. Overton, V. Tannen, and L. Wong, BioKleisli: A digital library for biomedical researchers. J. Digital Libraries 1997; 1(1): 36–53.
Google Scholar
44S. B. Davidson, J. Crabtree, B. P. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J. Stoeckert, Jr., K2/Kleisli and GUS: Experiments in integrated access to genomic data sources. IBM Syst. J. 2001; 40(2): 512–531.
10.1147/sj.402.0512
Web of Science® Google Scholar
45E. M. Zdobnov, R. Lopez, R. Apweiler, and T. Etzold, The EBI SRS server-recent developments. Bioinformatics 2002; 18(2): 368–373.
10.1093/bioinformatics/18.2.368
CAS PubMed Web of Science® Google Scholar
46I-M. A. Chen, A. S. Kosky, V. M. Markowitz, and E. Szeto, Constructing and maintaining scientific database views in the framework of the object-protocol model. Proc. 9th Int. Conf. on Scientific and Statistical Database Management, IEEE, New York: IEEE, 1997: 237–248.
Google Scholar
47G. Wiederhold, Mediators in the architecture of future information systems. IEEE Computer. 1992; 25(3): 38–42.
10.1109/2.121508
Web of Science® Google Scholar
48L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope, DiscoveryLink: A system for integrated access to life sciences data sources. IBM Syst. J. 2001; 40(2): 489–511.
10.1147/sj.402.0489
Web of Science® Google Scholar
49A. Sahuguet and F. Azavant, Building light-weight wrappers for legacy Web datasources using W4F. Proc. 25th Int. Conf. on Very Large Data Bases, Edinburgh, Scotland, U.K., September 7–10, 1999.
Google Scholar
50J. Chen, J. B. Anderson, C. DeWeese-Scott, N. D. Fedorova, L. Y. Geer, S. He, D. I. Hurwitz, J. D. Jackson, A. R. Jacobs, G. J. Lanczycki, C. A. Liebert, C. Liu, T. Madej, A. Marchler-Bauer, G. H. Marchler, R. Mazumder, A. N. Nikolskaya, B. S. Rao, A. R. Panchenko, B. A. Shoemaker, V. Simonyan, J. S. Song, P. A. Thiessen, S. Vasudevan, Y. Wang, R. A. Yamashita, J. J. Yin, and S. H. Bryant, MMDB: Entrez's 3D-structure database. Nucleic Acids Res. 2003; 31(1): 474–477.
10.1093/nar/gkg086
CAS PubMed Web of Science® Google Scholar
51J. A. Epstein, J. A. Kans, and G. D. Schuler, WWW Entrez: A hypertext retrieval tool for molecular biology. Proc. the 2nd Int. World Wide Web Conf., Chicago, IL October 1994.
Google Scholar
52T. Etzold, A. Ulyanov, and P. Argos, SRS: Information retrieval system for molecular biology data banks. Methods Enzymol. 1996; 266: 114–128.
10.1016/S0076-6879(96)66010-8
CAS PubMed Web of Science® Google Scholar
53Z. Ben Miled, N. Li, G. M. Kellet, B. Sipes, and O. Bukhres, Complex life science multidatabase queries. Proc. IEEE 2002; 90(11): 1754–1763.
10.1109/JPROC.2002.804683
Web of Science® Google Scholar
54S. Schulze-Kremer, Ontologies for molecular biology and bioinformatics. Silico Biol. 2002; 2(3): 179–193.
CAS PubMed Google Scholar
55R. Stevens, C.A. Goble, and S. Bechhofer, Ontology-based knowledge representation for bioinformatics. Brief. Bioinformatics 2000; 1(4): 398–414.
10.1093/bib/1.4.398
CAS PubMed Google Scholar
56The Gene Ontology Consortium, Gene Ontology: Tool for the unification of biology. Nature Genet. 2000; 25:25–29.
Google Scholar

Wiley Encyclopedia of Biomedical Engineering

Browse other articles of this reference work: