Prediction of RNA-binding residues in proteins from primary sequence using an enriched random forest model with a novel hybrid feature
Xin Ma
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Department of Elementary Courses, Golden Audit College, Nanjing Audit University, Nanjing 210029, People's Republic of China
Search for more papers by this authorJing Guo
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJiansheng Wu
Department of Bioinformatics, School of Geography and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210046, People's Republic of China
Search for more papers by this authorHongde Liu
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJiafeng Yu
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJianming Xie
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorCorresponding Author
Xiao Sun
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China===Search for more papers by this authorXin Ma
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Department of Elementary Courses, Golden Audit College, Nanjing Audit University, Nanjing 210029, People's Republic of China
Search for more papers by this authorJing Guo
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJiansheng Wu
Department of Bioinformatics, School of Geography and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210046, People's Republic of China
Search for more papers by this authorHongde Liu
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJiafeng Yu
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorJianming Xie
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
Search for more papers by this authorCorresponding Author
Xiao Sun
State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, People's Republic of China
State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, People's Republic of China===Search for more papers by this authorAbstract
The identification of RNA-binding residues in proteins is important in several areas such as protein function, posttranscriptional regulation and drug design. We have developed PRBR (Prediction of RNA Binding Residues), a novel method for identifying RNA-binding residues from amino acid sequences. Our method combines a hybrid feature with the enriched random forest (ERF) algorithm. The hybrid feature is composed of predicted secondary structure information and three novel features: evolutionary information combined with conservation information of the physicochemical properties of amino acids and the information about dependency of amino acids with regards to polarity-charge and hydrophobicity in the protein sequences. Our results demonstrate that the PRBR model achieves 0.5637 Matthew's correlation coefficient (MCC) and 88.63% overall accuracy (ACC) with 53.70% sensitivity (SE) and 96.97% specificity (SP). By comparing the performance of each feature we found that all three novel features contribute to the improved predictions. Area under the curve (AUC) statistics from receiver operating characteristic curve analysis was compared between PRBR model and other models. The results show that PRBR achieves the highest AUC value (0.8675) which represents that PRBR attains excellent performance on predicting the RNA-binding residues in proteins. The PRBR web-server implementation is freely available at http://www.cbi.seu.edu.cn/PRBR/. Proteins 2011; © 2011 Wiley-Liss, Inc.
Supporting Information
Additional Supporting Information may be found in the online version of this article.
Filename | Description |
---|---|
PROT_22958_sm_suppinfo.pdf254.2 KB | supporting Information. |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
REFERENCES
- 1 Beaudoin ME,Poirel VJ,Krushel LA. Regulating amyloid precursor protein synthesis through an internal ribosomal entry site. Nucleic Acids Res 2008; 36: 6835–6847.
- 2 Jacobs Anderson JS,Parker R. Computational identification of cis-acting elements affecting post-transcriptional control of gene expression in Saccharomyces cerevisiae. Nucleic Acids Res 2000; 28: 1604–1617.
- 3 Newcomb LL,Kuo RL,Ye Q,Jiang Y,Tao YJ,Krug RM. Interaction of the influenza a virus nucleocapsid protein with the viral RNA polymerase potentiates unprimed viral RNA replication. J Virol 2009; 83: 29–36.
- 4 Yu Z,Sanchez-Velar N,Catrina IE,Kittler EL,Udofia EB,Zapp ML. The cellular HIV-1 Rev cofactor hRIP is required for viral replication. Proc Natl Acad Sci USA 2005; 102: 4027–4032.
- 5 Abdelmohsen K,Kuwano Y,Kim HH,Gorospe M. Posttranscriptional gene regulation by RNA-binding proteins during oxidative stress: implications for cellular senescence. Biol Chem 2008; 389: 243–255.
- 6 Saunus JM,French JD,Edwards SL,Beveridge DJ,Hatchell EC,Wagner SA,Stein SR,Davidson A,Simpson KJ,Francis GD,Leedman PJ,Brown MA. Posttranscriptional regulation of the breast cancer susceptibility gene BRCA1 by the RNA binding protein HuR. Cancer Res 2008; 68: 9469–9478.
- 7 Hermann T,Westhof E. RNA as a drug target: chemical, modelling, and evolutionary tools. Curr Opin Biotechnol 1998; 9: 66–73.
- 8 Huang L,Massa L,Karle J. Drug target interaction energies by the kernel energy method in aminoglycoside drugs and ribosomal A site RNA targets. Proc Natl Acad Sci USA 2007; 104: 4261–4266.
- 9 Mangasarian A. Alternative RNA splicing and drug target identification. IDrugs 2005; 8: 725–729.
- 10 Li Q,Cao Z,Liu H. Improve the prediction of RNA-binding residues using structural neighbours. Protein Pept Lett 2010; 17(3): 287–296
- 11 Wang Y,Xue Z,Shen G,Xu J. PRINTR: prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids 2008; 35: 295–302.
- 12 Jeong E,Chung IF,Miyano S. A neural network method for identification of RNA-interacting residues in protein. Genome Inform 2004; 15: 105–116.
- 13 Maetschke SR,Yuan Z. Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics 2009; 10: 341.
- 14 Terribilini M,Lee JH,Yan C,Jernigan RL,Honavar V,Dobbs D. Prediction of RNA binding sites in proteins from amino acid sequence. RNA 2006; 12: 1450–1462.
- 15 Cheng CW,Su EC,Hwang JK,Sung TY,Hsu WL. Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 2008; 9( Suppl 12): S6.
- 16 Kumar M,Gromiha MM,Raghava GP. Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins 2008; 71: 189–194.
- 17 Wang L,Brown SJ. Prediction of RNA-binding residues in protein sequences using support vector machines. Conf Proc IEEE Eng Med Biol Soc 2006; 1: 5830–5833.
- 18 Tong J,Jiang P,Lu ZH. RISP: a web-based server for prediction of RNA-binding sites in proteins. Comput Methods Programs Biomed 2008; 90: 148–153.
- 19 Wang L,Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res 2006; 34(Web Server issue): W243–W248.
- 20 Vapnik VN. Statisical learning theory. New York: Wiley; 1998.
- 21 Breiman L. Random forests. Machine Learn 2001; 45: 5–32.
- 22 Amaratunga D,Cabrera J,Lee YS. Enriched random forests. Bioinformatics 2008; 24: 2010–2014.
- 23 Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,Weissig H,Shindyalov IN,Bourne PE. The protein data bank. Nucleic Acids Res 2000; 28: 235–242.
- 24 Chen YC,Lim C. Predicting RNA-binding sites from the protein structure based on electrostatics, evolution and geometry. Nucleic Acids Res 2008; 36: e29.
- 25 Terribilini M,Sander JD,Lee JH,Zaback P,Jernigan RL,Honavar V,Dobbs D. RNABindR: a server for analyzing and predicting RNA-binding sites in proteins. Nucleic Acids Res 2007; 35(Web Server issue): W578–W584.
- 26 Sikic M,Tomic S,Vlahovicek K. Prediction of protein-protein interaction sites in sequences and 3D structures by random forests. PLoS Comput Biol 2009; 5: e1000278.
- 27 Han P,Zhang X,Norton RS,Feng ZP. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinformatics 2009; 10: 8.
- 28 Chen XW,Liu M. Prediction of protein-protein interactions using random decision forest framework. Bioinformatics 2005; 21: 4394–4400.
- 29 Storey JD,Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2003; 100: 9440–9445.
- 30 Altschul SF,Madden TL,Schaffer AA,Zhang J,Zhang Z,Miller W,Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–3402.
- 31 Zhang Q,Yoon S,Welsh WJ. Improved method for predicting beta-turn using support vector machine. Bioinformatics 2005; 21: 2370–2374.
- 32 Wang J. Biochemistry Higher Education (in chinese) 2002.
- 33 Bonchev D. The overall Wiener index—a new tool for characterization of molecular topology. J Chem Inf Comput Sci 2001; 41: 582–592.
- 34 Rose M. Re: Balaban et al.–Low volume bowel preparation for colonoscopy: randomized endoscopist-blinded trial of liquid sodium phosphate versus tablet sodium phosphate. Am J Gastroenterol 2003; 98: 2328–2329; author reply 2329.
- 35 Buntrock RE. ChemOffice Ultra 7.0. J Chem Inf Comput Sci 2002; 42: 1505–1506.
- 36 Xin Ma J-SW,Hong-De Liu,Xi-Nan Yang,Jian-Ming Xie,Xiao Sun. A SVM-based approach for predicting DNA-binding residues in proteins from amino acid sequences. Int Joint Conf Bioinform Syst Biol Intelligent Comp 2009: 225–229.
- 37
Frishman D,Argos P.
Seventy-five percent accuracy in protein secondary structure prediction.
Proteins
1997;
27:
329–335.
10.1002/(SICI)1097-0134(199703)27:3<329::AID-PROT1>3.0.CO;2-8 CAS PubMed Web of Science® Google Scholar
- 38 Wu J,Liu H,Duan X,Ding Y,Wu H,Bai Y,Sun X. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 2009; 25: 30–35.
- 39 Cohen G,Hilario M,Sax H,Hugonnet S,Geissbuhler A. Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 2006; 37: 7–18.
- 40 Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975; 405: 442–451.
- 41 Egan JP. Signal detection theory and ROC-analysis. New York: Academic Press; 1975.
- 42 Ding Y,Cai Y,Zhang G,Xu W. The influence of dipeptide composition on protein thermostability. FEBS Lett 2004; 569(1-3): 284–288.
- 43 Gromiha MM,Ahmad S,Suwa M. Application of residue distribution along the sequence for discriminating outer membrane proteins. Comput Biol Chem 2005; 29: 135–142.
- 44 Park KJ,Gromiha MM,Horton P,Suwa M. Discrimination of outer membrane proteins using support vector machines. Bioinformatics 2005; 21: 4223–4229.
- 45 DeLano WL. The PyMOL molecular graphics system. San Carlos, CA: DeLano Scientific; 2002.
- 46 Dimitriadou Eea. e1071: Misc functions of the department of statistics (e1071), TUWien. R package, Version 1.5-16. Available at: http://cran.r-project.org/. 2007.
- 47 Ahmad S,Sarai A. PSSM-based prediction of DNA binding sites in proteins. BMC Bioinform 2005; 6: 33.