Information Retrieval in Biomedical Research
Zina Ben Miled
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorNianhua Li
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorMalika Mahoui
Indiana University Purdue University Indianapolis, Computer and Information Science Department, Indianapolis, Indiana
Search for more papers by this authorOmran Bukhres
Indiana University Purdue University Indianapolis, Computer and Information Science Department, Indianapolis, Indiana
Search for more papers by this authorZina Ben Miled
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorNianhua Li
Indiana University Purdue University Indianapolis, Electrical and Computer Engineering Department, Indianapolis, Indiana
Search for more papers by this authorMalika Mahoui
Indiana University Purdue University Indianapolis, Computer and Information Science Department, Indianapolis, Indiana
Search for more papers by this authorOmran Bukhres
Indiana University Purdue University Indianapolis, Computer and Information Science Department, Indianapolis, Indiana
Search for more papers by this authorAbstract
Information retrieval is important in various biomedical research fields. This article covers the theoretical background and the state-of-the-art and future trends in biomedical information retrieval. Techniques for literature searches, genomic information retrieval, and database searches are discussed. Literature searches techniques cover name entity extraction, document indexing, document clustering, and event extraction. Genomic information retrieval techniques are based on sequence alignment algorithms. This article also briefly describes widely used biological databases and discusses the issues related to the information retrieval from these databases. Terminology systems are involved in almost every aspect of information retrieval. The various types of terminology systems and their usage to support information retrieval are reviewed.
Bibliography
- 1C. J. Van Rijsbergen, Information Retrieval, 2nd ed. New York: Butterworth-Heinemann, 1999.
- 2I. Iliopoulos, A. J. Enright, and C. A. Ouzounis, Textquest: Document clustering of MEDLINE abstracts for concept discovery in molecular biology. Proc. Pacific Symp. on Biocomputing, 2001: 384–395.
- 3G. Salton, Automatic Information Organization and Retrieval. New York: McGraw-Hill, 1968.
- 4W. R. Hersh, Information retrieval: A health and biomedical perspective, 2nd ed. New York: Springer-Verlag, 2003.
- 5E. M. Voorhees, Overview of TREC 2003, TREC 2003, 2003.
- 6L. Wong, PIES, a Protein Interaction Extraction System. Proc. Pacific Symp. on Biocomputing, 2001: 520–531.
- 7M. A. Andrade and A. Valencia, Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstract: Development of a prototype system. Proc. 5th Int. Conf. on Intelligent Systems for Molecular Biology, 1997: 25–32.
- 8D. L. Wheeler, D. M. Church, S. Federhen, A. E. Lash, T. L. Madden, J. U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. A. Tatusova, and L. Wagner, Database resources of the National Center for Biotechnology. Nucleic Acids Res. 2003; 31(1): 28–33.
- 9A. R. Aronson, O. Bodenreider, H. F. Chang, S. M. Humphrey, J. G. Mork, S. J. Nelson, T. C. Rindflesch, and W. J. Wilbur, The NLM Indexing Initiative, 2000 AMIA Annu. Fall Symp., 2000: 17–21.
- 10R. Willett, Recent trends in hierarchic document clustering: A critical review. Inform. Processing Manage 1988; 25: 577.
- 11P. Glenisson, P. Antal, J. Mathys, Y. Moreau, and B. Demoor, Evaluation of the vector space representation in text-based gene clustering. Proc. Pacific Symp. on Biocomputing, 2003: 391–402.
- 12 Gene Ontology Consortium, Creating the gene ontology resource: Design and implementation. Genetic Res. 2001; 11(8): 1425–1433.
- 13W. J. Wilbur, A thematic analysis of the AIDS literature. Proc. Pacific Symp. on Biocomputing, 2003: 386–397.
- 14L. Hirschman, J. C. Park, J. I. Tsujii, L. Wong, and C. H. Wu, Accomplishments and challenges in literature data mining for biology. Bioinformatics 2002; 18(12): 1553–1561.
- 15D. Hanisch, J. Fluck, H. Mevissen, and R. Zimmer, Playing biology's name game: Identifying protein names in scientific text. Proc. Pacific Symp. on Biocomputing, 2003: 403–414.
- 16K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi, Toward information extraction: Identifying protein names from biological papers. Proc. Pacific Symp. on Biocomputing, 1998: 707–718.
- 17M. Narayanaswamy, K. E. Ravikumar, and K. Vijay-shanker, A biological named entity recognizer. Proc. Pacific Symp. on Biocomputing, 2003: 427–450.
- 18N. Collier, C. Nobata, and J. I. Tsujii, Extracting the names of genes and gene products with a Hidden markov Model. Proc. 18th Int. Conf. on Computational Linguistics, 2000: 201–207.
- 19K. Takeuchi and N. Collier, Use of support vector machines in extended named entity recognition. Proc. 6th Conf. Natural Language Learning, 2002: 119–125.
- 20T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi, Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 2001; 17(2): 155–161.
- 21M. Krauthammer, A. Rzhetsky, P. Morozov, and C. Friedman, Using BLAST for identifying gene and protein names. Gene 2000; 259: 245–252.
- 22S. K. Ng and M. Wong, Toward routine automatic pathway discovery from on-line scientific text abstracts. Genome Inform. 1999; 10: 104–112.
- 23S. Y. Chung and L. Wong, Kleisli: A new tool for data integration in biology. Trends Biotechnol. 1999; 17(9): 351–355.
- 24E. R. Gansner and S. C. North, An open graph visualization system and its applications to software engineering. Software—Practics Experience 2000; 30: 1203–1233.
- 25J. Thomas, D. Milward, C. Ouzounis, S. Pulman, and M. Carroll, Automatic extraction of protein interactions from scientific abstracts. Proc. Pacific Symp. on Biocomputing, 2000: 538–549.
- 26J. C. Park, H. S. Kim, and J. J. Kim, Bidirectional incremental parsing for automatic pathway identification with combinatory categorical grammar. Proc. Pacific Symp. on Biocomputing, 2001: 396–407.
- 27C. Blaschke, M. A. Andrade, C. Ouzounis, and A. Valencia, Automatic extraction of biological information from scientific text: protein-protein interactions. ISMB 1999; 7: 60–67.
- 28A. Yakushiji, Y. Tateisi, Y. Miyao, and J. I. Tsujii, Event extraction from biomedical papers using a full parser. Proc. Pacific Symp. on Biocomputing, 2001: 408–419.
- 29T. C. Rindflesch, L. Tanabe, J. N. Weinstein, and L. Hunter, EDGAR: Extraction of drugs, genes and relations from the biomedical literature. Proc. Pacific Symp. on Biocomputing, 2000: 517–528.
- 30G. Leroy and H. Chen, Filing preposition-based templates to capture information from medical abstracts. Proc. Pacific Symp. on Biocomputing, 2002: 350–361.
- 31B. J. Stapley and G. Benoit, Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in MEDLINE abstracts. Proc. Pacific Symp. on Biocomputing, 2000: 529–540.
- 32J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, Mining MEDLINE: Abstracts, sentences, or phrases? Proc. Pacific Symp. on Biocomputing, 2002: 326–337.
- 33S. Brenner and F. Lewitter, Trends Guide to Bioinformatics. London: Elsevier, 1998.
- 34T. F. Smith and M. S. Waterman, Identification of common molecular subsequences. J. Mol. Biol. 1981; 147: 195–197.
- 35A. D. Baxevanis and B. F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. New York: Wiley, 1998.
10.1002/9780470110607 Google Scholar
- 36Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, A greedy algorithm for aligning DNA sequences. J. Computat. Biol. 2000; 7(1–2): 203–214.
- 37T. Kahveci and A. Singh, MAP: Searching large genome databases. Proc. Pacific Symp. on Biocomputing, 2003: 303–314.
- 38B. Ma, J. Tromp, and M. Li, PatternHunter: Faster and more sensitive homology search. Bioinformatics. 2002; 18(3): 440–445.
- 39T. A. Tatusova and T. L. Madden, Blast 2 sequences-a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 1999; 174: 247–250.
- 40S. F. Altschul, T. L. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, Gapped BLAST and PSI-BLAST: A new generation of protein databse search programs. Nucleic Acids Res. 1997; 25(17): 3389–3402.
- 41A. Delcher, A. Phillippy, J. Carlton, and S. L. Salzberg, Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002; 30(11): 2478–2483.
- 42C. Burks, Molecular biology database list. Nucleic Acids Res. 1999; 27(1): 1–9.
- 43A. D. Baxevanis, The Molecular Biology Database Collection: 2003 update. Nucleic Acids Res. 2003; 31(1): 1–12.
- 44E. M. Zdobnov, R. Lopez, R. Apweiler, and T. Etzold, The EBI SRS server-new features. Bioinformatics 2002; 18(8): 1149–1150.
- 45D. Frishman, K. Heumann, A. Lesk, and H. Mewes, Comprehensive, comprehensible, distributed and intelligent databases: Current status. Bioinformatics 1998; 14(7): 551–561.
- 46L. Hirschman, C. Friedman, R. Mcentire, and C. Wu, Linking biological language, information and knowledge. Proc. Pacific Symp. on Biocomputing, 2003: 388–390.
- 47S. Schulze-Kremer, Ontologies for molecular biology. Proc. Pacific Symp. on Biocomputing, 1998: 693–704.
- 48T. K. Jenssen, A. Lagreid, J. Komorowski, and E. Hovig. A literature network of human genes for high-throughput analysis of gene expression. Nature Gen. 2001; 28: 21.
- 49M. Safran, I. Solomon, O. Shmueli, M. Lapidot, S. Shen-Orr, A. Adato, U. Ben-Dor, N. Esterman, N. Rosen, I. Peter, T. Olender, V. Chalifa-Caspi, and D. Lancet, GeneCards 2002: Towards a complete, object-oriented, human gene compendium. Bioinformatics 2002; 18(11): 1542–1543.
- 50P. D. Karp, EcoCyc: The Resource and the Lessons Learned. New York: Kluwer Academic, 1999, pp. 47–62.
- 51P. D. Karp, C. Ouzounis, and S. Paley, HinCyc: A knowledge base of the complete genome and metabolic pathways of H. influenzae. Proc. 4th Int. Conf. on Intelligent Systems in Molecular Biology, 1999: 116–124.
- 52S. I. Letovsky, R. W. Cottingham, C. J. Proter, and P. W. D. Li, GDB: The Human Genome Database. Nucleic Acids Res. 1998; 26: 94–99.
- 53C. A. Goble, N. W. Paton, R. Stevens, P. G. Baker, G. Ng, M. Peim, S. Bechhofer, and A. Brass, Transparent access to multiple bioinformatics information sources. IBM Syst. J. 2001; 40(2): 532–551.
- 54Z. Ben Miled, N. Li, G. M. Kellet, B. Sipes, and O. Bukhres, Complex life science multidatabase queries. Proc. IEEE 2002; 90(11): 1754–1763.
- 55R. Stevens, C. Goble, P. Baker, and A. Brass, A classification of tasks in bioinformatics. Bioinformatics 2001; 17(2): 180–188.
- 56M. A. Andrade, N. P. Brown, C. Leroy, S. Hoersch, A. de Daruvar, C. Reich, A. Franchini, J. Tamames, A. Valencia, C. Ouzounis, and C. Sander, Automated genome sequence analysis and annotation. Bioinformatics 1999; 15(5): 391–412.