Comparison of sequence and structure alignments for protein domains†
Aron Marchler-Bauer
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Search for more papers by this authorAnna R. Panchenko
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Search for more papers by this authorCorresponding Author
Naomi Ariel
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Computational Biology Branch, National Center for Biotechnology Information, Building 38A, Room 8N805, National Institutes of Health, Bethesda, MD 20894===Search for more papers by this authorCorresponding Author
Stephen H. Bryant
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Computational Biology Branch, National Center for Biotechnology Information, Building 38A, Room 8N805, National Institutes of Health, Bethesda, MD 20894===Search for more papers by this authorAron Marchler-Bauer
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Search for more papers by this authorAnna R. Panchenko
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Search for more papers by this authorCorresponding Author
Naomi Ariel
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Computational Biology Branch, National Center for Biotechnology Information, Building 38A, Room 8N805, National Institutes of Health, Bethesda, MD 20894===Search for more papers by this authorCorresponding Author
Stephen H. Bryant
Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland
Computational Biology Branch, National Center for Biotechnology Information, Building 38A, Room 8N805, National Institutes of Health, Bethesda, MD 20894===Search for more papers by this authorThis article is a US Government work and, as such, is in the public domain in the United States of America.
Abstract
Profile search methods based on protein domain alignments have proven to be useful tools in comparative sequence analysis. Domain alignments used by currently available search methods have been computed by sequence comparison. With the growth of the protein structure database, however, alignments of many domain pairs have also been computed by structure comparison. Here, we examine the extent to which information from these two sources agrees. We measure agreement with respect to identification of homologous regions in each protein, that is, with respect to the location of domain boundaries. We also measure agreement with respect to identification of homologous residue sites by comparing alignments and assessing the accuracy of the molecular models they predict. We find that domain alignments in publicly available collections based on sequence and structure comparison are largely consistent. However, the homologous regions identified by sequence comparison are often shorter than those identified by 3D structure comparison. In addition, when overall sequence similarity is low alignments from sequence comparison produce less accurate molecular models, suggesting that they less accurately identify homologous sites. These observations suggest that structure comparison results might be used to improve the overall accuracy of domain alignment collections and the performance of profile search methods based on them. Proteins 2002;48:439–446. © 2002 Wiley-Liss, Inc.
REFERENCES
- 1 Doolittle RF. The multiplicity of domains in proteins. Annu Rev Biochem 1995; 64: 287–314.
- 2 Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001; 29: 37–40.
- 3 Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH. CDD: A database of donserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 2002; 30: 281–283.
- 4 Eddy SR, Mitchison G, Durbin R. Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 1995; 2: 9–23.
- 5 Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998; 14: 846–856.
- 6 Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29: 2994–3005.
- 7 Corpet F, Gouzy J, Kahn D. The ProDom database of protein domain families. Nucleic Acids Res 1998; 26: 323–326.
- 8 Gracy J, Argos P. Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics 1998; 14: 174–187.
- 9 Schultz J, Milpetz F, Bork P, Ponting CP. SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA 1998; 95: 5857–5864.
- 10
Sonnhammer EL,
Eddy SR,
Durbin R.
Pfam: a comprehensive database of protein domain families based on seed alignments.
Proteins
1997;
28:
405–420.
10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L CAS PubMed Web of Science® Google Scholar
- 11 Yona G, Linial N, Linial M. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 2000; 28: 49–55.
- 12 Krause A, Stoye J, Vingron M. The SYSTERS protein sequence cluster set. Nucleic Acids Res 2000; 28: 270–272.
- 13 Linial M, Yona G. Methodologies for target selection in structural genomics. Prog Biophys Mol Biol 2000; 73: 297–320.
- 14 Siddiqui AS, Dengler U, Barton GJ. 3Dee: a database of protein structural domains. Bioinformatics 2001; 17: 200–201.
- 15 Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995; 247: 536–540.
- 16 Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH—a hierarchic classification of protein domain structures. Structure 1997; 5: 1093–1108.
- 17 Wang Y, Anderson JB, Chen J, Geer LY, He S, Hurwitz DI, Liebert CA, Madej T, Marchler GH, Marchler-Bauer A, Panchenko AR, Shoemaker BA, Song JS, Thiessen PA, Yamashita RA, Bryant SH. MMDB: Entrez's 3D structure database. Nucleic Acids Res 2002; 30: 249–252.
- 18 Holm L, Sander C. The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 1994; 22: 3600–3609.
- 19 Shindyalov IN, Bourne PE. A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucleic Acids Res 2001; 29: 228–229.
- 20 Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr Opin Struct Biol 1996; 6: 377–385.
- 21
Dengler U,
Siddiqui AS,
Barton GJ.
Protein structural domains: analysis of the 3Dee domains database.
Proteins
2001;
42:
332–344.
10.1002/1097-0134(20010215)42:3<332::AID-PROT40>3.0.CO;2-S CAS PubMed Web of Science® Google Scholar
- 22 Hadley C, Jones DT. A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Struct Fold Des 1999; 7: 1099–1112.
- 23 Levitt M. Competitive assessment of protein fold recognition and alignment accuracy. Proteins 1997; Suppl 1: 92–104.
- 24
Matsuo Y,
Bryant SH.
Identification of homologous core structures.
Proteins
1999;
35:
70–79.
10.1002/(SICI)1097-0134(19990401)35:1<70::AID-PROT7>3.0.CO;2-9 CAS PubMed Web of Science® Google Scholar
- 25 Przytycka T, Aurora R, Rose GD. A protein taxonomy based on secondary structure. Nature Struct Biol 1999; 6: 672–682.
- 26 Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J. The Protein Data Bank and the challenge of structural genomics. Nature Struct Biol 2000; 7(Suppl): 957–959.
- 27 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215: 403–410.
- 28 Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Proteins 1995; 23: 356–369.
- 29 Marchler-Bauer A, Bryant SH. Measures of threading specificity and accuracy. Proteins 1997; Suppl 1: 74–82.
- 30 Holm L, Sander C. Parser for protein folding units. Proteins 1994; 19: 256–268.
- 31
Moult J,
Hubbard T,
Fidelis K,
Pedersen JT.
Critical assessment of methods of protein structure prediction (CASP): round III.
Proteins
1999;
Suppl 3:
2–6.
10.1002/(SICI)1097-0134(1999)37:3+<2::AID-PROT2>3.0.CO;2-2 CAS PubMed Web of Science® Google Scholar
- 32 Marchler-Bauer A, Bryant SH. A measure of success in fold recognition. Trends Biochem Sci 1997; 22: 236–240.
- 33 Elofsson A, Sonnhammer EL. A comparison of sequence and structure protein domain families as a basis for structural genomics. Bioinformatics 1999; 15: 480–500.
- 34 Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J 1986; 5: 823–826.
- 35 Flores TP, Orengo CA, Moss DS, Thornton JM. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci 1993; 2: 1811–1826.
- 36 Hubbard TJ, Blundell TL. Comparison of solvent-inaccessible cores of homologous proteins: definitions useful for protein modelling. Protein Eng 1987; 1: 159–171.
- 37 Russell RB, Barton GJ. Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J Mol Biol 1994; 244: 332–350.
- 38 Wood C, Pearson WR. Evolution of protein sequences and structures. J Mol Biol 1999; 291: 977–995.
- 39 Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999; 27: 2682–2690.
- 40
Sauder JM,
Arthur JW,
Dunbrack RL Jr.
Large-scale comparison of protein sequence alignment algorithms with structure alignments.
Proteins
2000;
40:
6–22.
10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7 CAS PubMed Web of Science® Google Scholar
- 41 Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000; 302: 205–217.
- 42 Henikoff S, Henikoff JG. Position-based sequence weights. J Mol Biol 1994; 243: 574–578.
- 43 Panchenko AR, Bryant SH. A comparison of position-specific score matrices based on sequence and structure alignments. Protein Science 2002; 11: 361–370.
- 44 Panchenko AR, Marchler-Bauer A, Bryant SH. Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 2000; 296: 1319–1331.
- 45 Murzin AG. Structure classification-based assessment of CASP3 predictions for the fold recognition targets. Proteins 1999; 37: 88–103.
- 46 Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22: 4673–4680.
- 47 Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299: 499–520.