Grouping of amino acid types and extraction of amino acid properties from multiple sequence alignments using variance maximization
James O. Wrabl
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
Search for more papers by this authorCorresponding Author
Nick V. Grishin
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas
Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX 75390-9050===Search for more papers by this authorJames O. Wrabl
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
Search for more papers by this authorCorresponding Author
Nick V. Grishin
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas
Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas
Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, TX 75390-9050===Search for more papers by this authorAbstract
Understanding of amino acid type co-occurrence in trusted multiple sequence alignments is a prerequisite for improved sequence alignment and remote homology detection algorithms. Two objective approaches were used to investigate co-occurrence, both based on variance maximization of the weighted residue frequencies in columns taken from a large alignment database. The first approach discretely grouped amino acid types, and the second approach extracted orthogonal properties of amino acids using principal components analysis. The grouping results corresponded to amino acid physical properties such as side chain hydrophobicity, size, or backbone flexibility, and an optimal arrangement of approximately eight groups was observed. However, interpretation of the orthogonal properties was more complex. Although the principal components accounting for the largest variances exhibited modest correlations with hydrophobicity and conservation of glycine, in general principal components did not correspond to physical properties of amino acids. Although not intuitive, these amino acid mathematical properties were demonstrated to be robust and to improve local pairwise alignment accuracy, relative to 20 amino acid frequencies alone, for a simple test case. Proteins 2005. © 2005 Wiley-Liss, Inc.
REFERENCES
- 1 Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004; 5: 113.
- 2 Pei J, Sadreyev R, Grishin NV. PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 2003; 19: 427–428.
- 3 Venclovas C. Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance. Proteins 2003; 53( Suppl 6): 380–388.
- 4 Jaroszewski L, Li W, Godzik A. In search for more accurate alignments in the twilight zone. Protein Sci 2002; 11: 1702–1713.
- 5 Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000; 29: 291–325.
- 6 Copley RR, Russell RB. Getting the most from your protein sequence. Methods Mol Biol 2003; 211: 411–430.
- 7 Taylor WR. The classification of amino acid conservation. J Theor Biol 1986; 119: 205–218.
- 8 French S, Robson B. What is a conservative substitution? J Mol Evol 1983; 19: 171–175.
- 9 Dayhoff MO. Atlas of protein sequence and structure. Washington, DC: National Biomedical Research Foundation; 1978.
- 10 Stanfel LE. A new approach to clustering the amino acids. J Theor Biol 1996; 183: 195–205.
- 11 Venkatarajan VS, Braun W. New quanitiative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J Mol Model 2001; 7: 445–453.
- 12 Kosiol C, Goldman N, Buttimore NH. A new criterion and method for amino acid classification. J Theor Biol 2004; 228: 97–106.
- 13 Li T, Fan K, Wang J, Wang W. Reduction of protein sequence complexity by residue grouping. Protein Eng 2003; 16: 323–330.
- 14 Cannata N, Toppo S, Romualdi C, Valle G. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics 2002; 18: 1102–1108.
- 15 Murphy LR, Wallqvist A, Levy RM. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 2000; 13: 149–152.
- 16 Esteve JG, Falceto F. A general clustering approach with application to the Miyazawa-Jernigan potentials for amino acids. Proteins 2004; 55: 999–1004.
- 17 Cieplak M, Holter NS, Maritan A, Banavar JR. Amino acid classes and the protein folding problem. J Chem Phys 2001; 114: 1420–1423.
- 18 Wang J, Wang W. A computational approach to simplifying the protein folding alphabet. Nat Struct Biol 1999; 6: 1033–1038.
- 19 Kuznetsov IB, Rackovsky S. Discriminative ability with respect to amino acid types: assessing the performance of knowledge-based potentials without threading. Proteins 2002; 49: 266–284.
- 20 Solis AD, Rackovsky S. Optimally informative backbone structural propensities in proteins. Proteins 2002; 48: 463–486.
- 21 Wu TD, Brutlag DL. Discovering empirically conserved amino acid substitution groups in databases of protein families. Proc Int Conf Intell Syst Mol Biol 1996; 4: 230–240.
- 22 Trinquier G, Sanejouand YH. Which effective property of amino acids is best preserved by the genetic code? Protein Eng 1998; 11: 153–169.
- 23 Tolstrup N, Toftgard J, Engelbrecht J, Brunak S. Neural network model of the genetic code is strongly correlated to the GES scale of amino acid transfer free energies. J Mol Biol 1994; 243: 816–820.
- 24 Wolfenden RV, Cullis PM, Southgate CC. Water, protein folding, and the genetic code. Science 1979; 206: 575–577.
- 25 Jukes TH. The amino acid code. Adv Enzymol Relat Areas Mol Biol 1978; 47: 375–432.
- 26 Sjostrom M, Wold S. A multivariate study of the relationship between the genetic code and the physical-chemical properties of amino acids. J Mol Evol 1985; 22: 272–277.
- 27 Hecht MH, Das A, Go A, Bradley LH, Wei Y. De novo proteins from designed combinatorial libraries. Protein Sci 2004; 13: 1711–1723.
- 28 Riddle DS, Santiago JV, Bray-Hall ST, Doshi N, Grantcharova VP, Yi Q, Baker D. Functional rapidly folding proteins from simplified amino acid sequences. Nat Struct Biol 1997; 4: 805–809.
- 29 Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem 1985; 4: 23–55.
- 30 Kinjo AR, Nishikawa K. Eigenvalue analysis of amino acid substitution matrices reveals a sharp transition of the mode of sequence conservation in proteins. Bioinformatics 2004; 20: 2504–2508.
- 31 Leary RH, Rosen JB, Jambeck P. An optimal structure-discriminative amino acid index for protein fold recognition. Biophys J 2004; 86: 411–419.
- 32 Casari G, Sander C, Valencia A. A method to predict functional residues in proteins. Nat Struct Biol 1995; 2: 171–178.
- 33 Higgins DG. Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. Comput Appl Biosci 1992; 8: 15–22.
- 34 Henikoff S, Henikoff JG, Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999; 15: 471–479.
- 35 Manly BFJ. Multivariate statistical methods: a primer. New York: Chapman and Hall; 1986.
- 36 Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng 1999; 12: 387–394.
- 37 Pei J, Grishin NV. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001; 17: 700–712.
- 38 Fauchere JL, Charton M, Kier LB, Verloop A, Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Int J Pept Protein Res 1988; 32: 269–278.
- 39 Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C: the art of scientific computing. New York: Cambridge University Press; 1992.
- 40 Kawashima S, Kanehisa M. AAindex: amino acid index database. Nucleic Acids Res 2000; 28: 374.
- 41 Anton H, Rorres C. Elementary linear algebra: applications version. New York: John Wiley & Sons; 1994.
- 42 Parker JM, Guo D, Hodges RS. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 1986; 25: 5425–5432.
- 43 Holm L, Sander C. The FSSP database: fold classification based on structure-structure alignment of proteins. Nucleic Acids Res 1996; 24: 206–209.
- 44 Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981; 147: 195–197.
- 45 Sadreyev R, Grishin N. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003; 326: 317–336.
- 46 Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992; 89: 10915–10919.
- 47 Wold S, Eriksson L, Hellberg S, Jonsson J, Sjostrom M, Skagerberg B, Wikstrom C. Principal property values for six non-natural amino acids and their application to a structure–activity relationship for oxytocin peptide analogues. Can J Chem 1987; 65: 1814–1820.