RESEARCH ARTICLE

Free Access

A sequence-based model for identifying proteins undergoing liquid–liquid phase separation/forming fibril aggregates via machine learning

Shaofeng Liao

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Methodology, Validation, Visualization, Investigation, Software, Formal analysis, Writing - original draft

Search for more papers by this author

Yujun Zhang,

Yujun Zhang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization

Search for more papers by this author

Xinchen Han,

Xinchen Han

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Methodology, Validation, Investigation, Formal analysis, Software, Writing - original draft

Search for more papers by this author

Tinglan Wang,

Tinglan Wang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Software, Visualization, Validation, Writing - review & editing

Search for more papers by this author

Xi Wang,

Xi Wang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Data curation, Formal analysis, Conceptualization

Search for more papers by this author

Qinglin Yan,

Qinglin Yan

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Visualization, Validation

Search for more papers by this author

Qian Li,

Qian Li

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Formal analysis, Data curation

Search for more papers by this author

Yifei Qi,

Corresponding Author

Yifei Qi

[email protected]

School of Pharmacy, Fudan University, Shanghai, China

Correspondence

Yifei Qi, School of Pharmacy, Fudan University, Shanghai 201203, China.

Email: [email protected]

Zhuqing Zhang, College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

Email: [email protected]

Contribution: Methodology, Supervision, Writing - original draft, Writing - review & editing, Project administration

Search for more papers by this author

Zhuqing Zhang,

Corresponding Author

Zhuqing Zhang

[email protected]

orcid.org/0000-0003-3514-834X

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Correspondence

Yifei Qi, School of Pharmacy, Fudan University, Shanghai 201203, China.

Email: [email protected]

Zhuqing Zhang, College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

Email: [email protected]

Contribution: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Resources, Project administration, Writing - review & editing, Formal analysis

Search for more papers by this author

Shaofeng Liao,

Shaofeng Liao

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Methodology, Validation, Visualization, Investigation, Software, Formal analysis, Writing - original draft

Search for more papers by this author

Yujun Zhang,

Yujun Zhang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization

Search for more papers by this author

Xinchen Han,

Xinchen Han

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Methodology, Validation, Investigation, Formal analysis, Software, Writing - original draft

Search for more papers by this author

Tinglan Wang,

Tinglan Wang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Software, Visualization, Validation, Writing - review & editing

Search for more papers by this author

Xi Wang,

Xi Wang

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Data curation, Formal analysis, Conceptualization

Search for more papers by this author

Qinglin Yan,

Qinglin Yan

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Visualization, Validation

Search for more papers by this author

Qian Li,

Qian Li

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Contribution: Formal analysis, Data curation

Search for more papers by this author

Yifei Qi,

Corresponding Author

Yifei Qi

[email protected]

School of Pharmacy, Fudan University, Shanghai, China

Correspondence

Yifei Qi, School of Pharmacy, Fudan University, Shanghai 201203, China.

Email: [email protected]

Zhuqing Zhang, College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

Email: [email protected]

Contribution: Methodology, Supervision, Writing - original draft, Writing - review & editing, Project administration

Search for more papers by this author

Zhuqing Zhang,

Corresponding Author

Zhuqing Zhang

[email protected]

orcid.org/0000-0003-3514-834X

College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China

Correspondence

Yifei Qi, School of Pharmacy, Fudan University, Shanghai 201203, China.

Email: [email protected]

Zhuqing Zhang, College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.

Email: [email protected]

Contribution: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Resources, Project administration, Writing - review & editing, Formal analysis

Search for more papers by this author

First published: 21 February 2024

https://doi.org/10.1002/pro.4927

Citations: 5

Shaofeng Liao and Yujun Zhang contributed equally to this work.

Reviewing Editor: Nir Ben Tal

Share a link

Email
Wechat
Bluesky

Abstract

Liquid–liquid phase separation (LLPS) and the solid aggregate (also referred to as amyloid aggregates) formation of proteins, have gained significant attention in recent years due to their associations with various physiological and pathological processes in living organisms. The systematic investigation of the differences and connections between proteins undergoing LLPS and those forming amyloid fibrils at the sequence level has not yet been explored. In this research, we aim to address this gap by comparing the two types of proteins across 36 features using collected data available currently. The statistical comparison results indicate that, 24 of the selected 36 features exhibit significant difference between the two protein groups. A LLPS-Fibrils binary classification model built on these 24 features using random forest reveals that the fraction of intrinsically disordered residues (F_IDR) is identified as the most crucial feature. While, in the further three-class LLPS-Fibrils-Background classification model built on the same screened features, the composition of cysteine and that of leucine show more significant contributions than others. Through feature ablation analysis, we finally constructed a model FLFB (Feature-based LLPS-Fibrils-Background protein predictor) using six refined features, with an average area under the receiver operating characteristics of 0.83. This work indicates using sequence features and a machine learning model, proteins undergoing LLPS or forming amyloid fibrils can be identified.

1 INTRODUCTION

Liquid–liquid phase separation (LLPS) has been recognized to underlie the formation of biomolecular condensates and membraneless organelles (MLOs) in cells, and play a crucial role in various biological processes such as stress responses, RNA metabolism, and chromatin organization (Banani et al., 2017; Boeynaems et al., 2018; Lyon et al., 2021). Experimental evidence suggests that the dysregulation of biomolecular LLPS can contribute to the development of diverse diseases (Shin & Brangwynne, 2017; Zhang et al., 2020). Certain pathogenic mutations in proteins such as FUS, TDP43, and Tau, can cause the transformation of liquid condensates into solid-like aggregates, which are strongly associated with several incurable neurodegenerative diseases (Ahmad et al., 2022; King et al., 2012; Patel et al., 2015; Wegmann et al., 2018). However, it is worth noting that the majority of proteins undergoing LLPS do not form solid aggregates. Conversely, solid aggregates can also be formed via oligomeric intermediates in a homogeneous system, not just through LLPS (Michaels et al., 2020; Vendruscolo & Fuxreiter, 2022). Despite the availability of a number of tools for predicting proteins that undergo LLPS (Chu et al., 2022; Raimondi et al., 2021; Saar et al., 2021; Shen et al., 2021; van Mierlo et al., 2021; Vernon & Forman-Kay, 2019) or that can potentially form amyloid fibrils (Prabakaran, Rawat, Thangakani, et al., 2021), only a few of them, such as FuzDrop (Hardenberg et al., 2020), provide information of droplet-promoting regions and aggregation-promoting regions for a query protein. Therefore, the differences and connections in sequence between these two types of proteins remain unclear.

The underlying driving forces of protein LLPS have been attributed to multivalent inter- and intrainteractions, including electrostatic, Pi–Pi, Cation–Pi, and hydrophobic interactions, and so forth (Dignon et al., 2020). Furthermore, experimental studies have highlighted the crucial role of intrinsically disordered proteins (IDPs) and regions, particularly low complexity (LC) regions, in the formation of biomolecular liquid condensates (Alberti et al., 2019; Fonin et al., 2022; Martin & Mittag, 2018). IDPs are one kind of proteins that have no stable 3D-structure under normal physiological conditions when they are free, but play non-negligible influences in various cellular events (Dunker et al., 2002). In biomolecular LLPS or the formation of MLOs, they not only can form multivalent interactions, but their intrinsic flexibility can contribute the liquid property of biomolecular condensates. The molecular grammar of proteins, especially IDPs, which undergo LLPS, has been explored in extensively, both in experimental and in theoretical and simulated investigations (Bremer et al., 2022; Dignon et al., 2020; Lin et al., 2016; Martin et al., 2020). It indicates that not only the amino acid composition, but also the distribution patterns of amino acids involved in different types of multivalent molecular interactions determine the phase behavior of a protein. In fact, the investigation of amyloid formation mechanisms predates the study of protein LLPS. The various types of multivalent interactions observed in LLPS of proteins are also essential factors in the formation of amyloids. In this study, our focus is on whether quantitative parameters derived from protein sequences can effectively differentiate proteins involved in LLPS and those associated with fibril formation.

Recent advancements have led to the construction of databases dedicated to proteins undergoing LLPS (Go et al., 2019; Li, Peng, et al., 2020; Li, Wang, et al., 2020; Meszaros et al., 2020; Ning et al., 2020; You et al., 2020; Youn et al., 2019) and those forming amyloid fibrils (Prabakaran, Rawat, Thangakani, et al., 2021), providing a solid foundation of data for analyzing the sequence features of these protein types. In this study, we leverage these valuable resources, specifically the recently released LLPSDB v2.0 (Wang et al., 2022) for proteins involved in LLPS, as well as the databases AmyPro (Varadi et al., 2018) and CPAD2.0 (Rawat et al., 2020) for proteins associated with amyloid fibril formation, and systematically compare selected sequence features, including sequence characteristics, molecular interactions, secondary structure propensity, fraction of intrinsically disordered residues (F_IDRs) and other relevant features, between the two protein categories. Based on screened features with significant difference, we construct classification models capable of distinguishing between liquid–liquid phase separating proteins (hereafter referred to as “LLPS proteins”), fibril-forming proteins (hereafter referred to as “fibrils proteins”), and other proteins. Different from previous researches which mostly focused on either “LLPS protein” or “fibrils protein,” in this study, we attempt to elucidate the disparity and relationship of sequences between the two protein categories. The screened features that exhibit significant difference and offer more contributions in the classification models may provide new insights into the understanding of mechanism governing related cellular processes.

2 RESULTS AND DISCUSSION

2.1 Features with significant difference between “LLPS” and “fibrils”

To compare the sequence difference between “LLPS protein” or “fibrils protein,” we first constructed an “LLPS” dataset based on LLPSDB v2.0 (Wang et al., 2022) and a “Fibrils” dataset based on AmyPro (Varadi et al., 2018) and CPAD 2.0 (Rawat et al., 2020). The two datasets contain 182 and 136 sequences, respectively (see “Section 4.1” for details).

According to the knowledge from extensive investigations about the mechanism of LLPS and fibril formation of proteins, we selected 36 sequence features for statistical analysis of the two protein categories, which include—composition of single amino acid; proportion of charged, aromatic and hydrophobic residues, respectively; net charge per residue (NCPR); sequence charge decoration (SCD); Sawle & Ghosh, 2015), κ (Das & Pappu, 2013), sequence hydropathy decoration (SHD; Zheng et al., 2020), Ω_Cation–π and Ω_π–π (Martin et al., 2020), representing amino acid distribution pattern involved in electrostatic (refer to SCD and κ), hydrophobic, cation–π and π–π interaction, respectively; a parameter ν (Zheng et al., 2020) which is combined SCD and SHD; propensity of protein secondary structure α-helix, β-turn, and β-sheet, respectively; protein solubility; F_IDR; fraction of sequence LC region. More details about these features can be found in Section 4.2 and Table S1.

The statistical comparison reveals that for the selected 36 features, 24 of them exhibit a significant difference (p-value <0.05) between the “LLPS” and “Fibrils” datasets, as shown in Table S1 and Figure 1. The top three features with the largest distinction include F_IDR, “α-helix” propensity and LC with p-values lower than 10⁻¹². LLPS proteins display a higher F_IDR, LC, and a lower “α-helix” propensity comparing with fibrils proteins. It reveals that intrinsically disorder and LC may be basic attributes of LLPS proteins. Based on the findings, we construct a LLPS-Fibril protein classification model to examine whether the screened 24 features can be utilized to classify the two types of protein well, and further build a three-class (LLPS-Fibrils-Background) classification model for prediction.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

The boxplots of the comparison of 24 features with significant differences between “LLPS” and “Fibrils” datasets. A p-value is calculated through the Kolmogorov–Smirnov test (*p < 0.05; **p < 0.01; ***p < 0.001; **** p < 0.0001). FHR, fraction of hydrophobic residues; LLPS, liquid–liquid phase separation.

2.2 LLPS-Fibrils binary classification model

In order to gain insights into the importance of the 24 screened features in identifying LLPS proteins from fibrils proteins, we initially construct an LLPS-Fibrils binary classification model through the Random Forest (RF) algorithm and the training strategy described in Section 4.3. In this model, the samples in the “Fibrils” dataset were considered as positives, while those in the “LLPS” dataset were treated as negatives. The results, including evaluation parameters such as area under the receiver operating characteristic curve (AUROC), Precision, Recall, F1 Score, and confusion matrixes are summarized in Table 1, Figure S1, and Table S2. The standard deviation observed for these parameters during the 5-fold cross-validation demonstrates the stability of the model. The average AUROC value of 0.837 suggests that the screened 24 features are effective in distinguishing LLPS proteins from fibrils proteins. However, it should be noted that the F1 Score, Recall, and Matthew's correlation coefficient (MCC) values are relatively low, which may be attributed to the insufficiency of the model or the limitation of the dataset.

TABLE 1. The average values of evaluation parameters of the two-class LLPS-fibrils identification model training with 5-fold cross-validation.

AUROC	Precision	Recall
0.837 ± 0.058	0.750 ± 0.072	0.642 ± 0.081

F1_score	Accuracy	MCC
0.687 ± 0.058	0.748 ± 0.069	0.487 ± 0.125

Abbreviations: AUROC, area under the receiver operating characteristics; LLPS, liquid–liquid phase separation; MCC, Matthew's correlation coefficient.

We then conduct an analysis using both SHAP and the scikit-learn package “Feature_importances.” Remarkably, both methods yield similar results, as Figure S2 shows that the top four features that contributed significantly to the model are identified as follows: F_IDR, α-helix propensity, LC and SHD. Figure 2 shows the distribution of SHAP value of each sequence in datasets, indicating that F_IDR emerges as the most influential feature notably, with a higher F_IDR value associated with a more positive prediction for LLPS proteins. The significance of intrinsically disorder for protein LLPS has been discussed widely and extensively (Alberti et al., 2019; Fonin et al., 2022), because IDPs/IDRs play an important role not only as “stickers” that the multivalent interactions within coacervates come from, but as “spacers” which provide the flexibility of molecules and keep coacervates as liquid. The results here indicate that LLPS proteins are more disordered than fibril proteins. Meantime, what a bit out of expected is that fibrils proteins show a positive correlation with higher α-helix propensities, since fibril aggregates usually form a “cross-β” structure. In addition, Figure 2 also exhibits that the features such as higher LC and SHD values, as well as larger component of arginine are found to be attributes more commonly in LLPS proteins than fibrils proteins. Overall, the analysis of feature importance using both SHAP and “Feature_importances” highlights the importance weight of the identified features in distinguishing between LLPS proteins and fibrils proteins.

Although a simplistic LLPS-Fibrils binary identification model could not meet the requirements for classifying proteins belonging to both categories or none of them, it offers the information that the screened features may be effective for the classification. Based on this, we further construct a multiclass prediction model as the following.

2.3 Three-class LLPS-Fibrils-Background classification model

In fact, several LLPS proteins have been observed to further form fibrils experimentally, which are associated with diseases in some cases (Patel et al., 2015; Wegmann et al., 2018). However, the number of proteins involved both in LLPS and fibrils is limited, as only 30 sequences (after CD-hit with a cutoff 0.4) were screened from LLPSDB v2.0 with the label “droplet-to-aggregate” (and similar term), letting them not adequate for training as a class. In addition, many proteins may do not undergo LLPS or form fibril aggregates under normal physiological condition, and we call them as general proteins in this work. Until now, there has no a strict dataset of general proteins validated by experiments. To identify “LLPS” and “Fibrils” proteins from the general proteins, we constructed a “Background” dataset using the human proteome, excluding the sequences in the “LLPS” and “Fibrils” datasets. Using the 24 features obtained above, we developed LLPS-Fibrils-Background classification models using RF and the training strategy described in Section 4.3.

The evaluation results presented in Table 2, Table S3, and Figure S3, which reveal that the AUROC values for the three classes are all larger than 0.83, with the one for “Background” type slightly larger than those for other two groups. The classification for the “Fiber” type exhibits relatively lower values of Recall, F1 Score, and MCC, but slightly larger values of Precision and Accuracy. This suggests that the model can accurately identify fibrils proteins with high confidence, but a significant fraction of fibrils proteins may be classified as LLPS or background proteins. Overall, the three-class classification models indicate a reasonable performance in distinguishing between LLPS, fibrils, and general proteins.

TABLE 2. The average values of evaluation parameters of LLPS-fibrils-background classification model training with 5-fold cross-validation for the three-class.

	Background	LLPS	Fibrils
AUROC	0.856 ± 0.027	0.834 ± 0.023	0.832 ± 0.043
Precision	0.660 ± 0.0590	0.656 ± 0.025	0.682 ± 0.035
Recall	0.719 ± 0.039	0.731 ± 0.059	0.485 ± 0.034
F1_score	0.686 ± 0.033	0.690 ± 0.032	0.566 ± 0.026
Accuracy	0.760 ± 0.033	0.762 ± 0.020	0.798 ± 0.011
MCC	0.496 ± 0.060	0.502 ± 0.046	0.451 ± 0.030

Abbreviations: AUROC, area under the receiver operating characteristics; LLPS, liquid–liquid phase separation; MCC, Matthew's correlation coefficient.

By using SHAP and the scikit-learn package “Feature_importances,” we analyzed the selected features' contribution in the LLPS-Fibrils-Background classification model. Figure 3 indicates that both methods exhibit the top seven significant features in a same order: the component of cysteine, leucine, and asparagine, respectively, SHD, the component of tryptophan, F_IDR, as well as LC. It is apparently different from the results obtained in LLPS-Fibrils classification model. For the top two features—the components of cysteine and leucine, Figure S4 shows their distribution comparison between “LLPS,” “Fibrils,” and “Background” datasets. It reveals that while the variations in cysteine and leucine content between the “LLPS” and “Fibrils” datasets may not be as pronounced compared with other features, these two attributes exhibit remarkable distinctions between general proteins in the background dataset and LLPS/fibril proteins, rendering them the most significant features in the three-class identification model. Conversely, the most prominent feature F_IDR in the two-class identification model demonstrates moderate importance in the LLPS-Fibrils-Background classification model. Overall, the analysis of feature importance in the LLPS-Fibrils-Background classification model reveals distinctive contributions of the features compared with the binary classification model.

Due to the features used in the models were selected according to experienced understanding and screened through statistical difference (KS-test), there might be certain redundancy among them. In order to optimize these features and reduce the number to achieve comparable performance, we conducted an ablation analysis. Starting with the initial set of 24 features, we progressively removed one feature at a time while retaining the remaining 23 features to train the model. The results in Table S4 indicate that removing a single feature does not significantly impact the classification performance, even when the most important feature, such as the component of cysteine, is removed. This suggests that the remaining features are capable of compensating for the removed feature. Since there are numerous possible combinations of the remaining features, testing all combinations would be time-consuming. To simplify the analysis, we trained the model by progressively removing features based on their importance, as determined by the mean absolute SHAP values in Figure 3a. We started by removing the feature with the lowest importance, which is the component of tyrosine (Y), resulting in a model trained using the left 23 features with the averaged AUROC value of 0.84 ± 0.036 (on line 3 of Table 3). We continued this process, removing additional features in an order of decreasing importance, such as Y + V, Y + V + F, and so forth. The averaged AUROC values of these models are listed in Table 3, revealing that the model's performance almost remains stable (the averaged AUROC value of 0.83) until the feature LC is removed. It means that a model consisting of six features (the component of C, L, N, SHD, W, and F_IDR) can achieve comparable identification performance to models with a larger number of features. Based on the ablation analysis, the final LLPS-Fibrils-Background three-class model (FLFB) was trained using these six features. This approach helps to reduce the complexity of the model while maintaining its classification performance.

TABLE 3. The average AUROC values of three-class LLPS-fibrils-background classification models, constructed by sequentially removing accumulated features in an reverse order of importance according to the SHAP analysis (Figure 3a).

Feature (retained)	Background	LLPS	Fibrils	3class_ave
Y	0.856 ± 0.027	0.834 ± 0.023	0.832 ± 0.043	0.84 ± 0.034
V	0.855 ± 0.027	0.831 ± 0.023	0.832 ± 0.047	0.84 ± 0.036
F	0.856 ± 0.027	0.829 ± 0.020	0.827 ± 0.050	0.84 ± 0.037
I	0.856 ± 0.026	0.830 ± 0.025	0.830 ± 0.045	0.84 ± 0.036
NCPR	0.858 ± 0.028	0.830 ± 0.026	0.833 ± 0.050	0.84 ± 0.038
FAR	0.855 ± 0.028	0.830 ± 0.023	0.831 ± 0.051	0.84 ± 0.038
Ω_π–π	0.855 ± 0.025	0.831 ± 0.025	0.822 ± 0.051	0.84 ± 0.039
T	0.853 ± 0.024	0.827 ± 0.023	0.818 ± 0.046	0.83 ± 0.036
FCR	0.856 ± 0.028	0.832 ± 0.025	0.822 ± 0.045	0.84 ± 0.037
P	0.856 ± 0.028	0.832 ± 0.026	0.825 ± 0.055	0.84 ± 0.041
G	0.856 ± 0.028	0.832 ± 0.025	0.823 ± 0.052	0.84 ± 0.040
ν	0.805 ± 0.031	0.829 ± 0.030	0.814 ± 0.059	0.83 ± 0.045
FHR	0.858 ± 0.029	0.827 ± 0.027	0.824 ± 0.050	0.84 ± 0.040
R	0.858 ± 0.026	0.829 ± 0.021	0.823 ± 0.053	0.84 ± 0.039
P_sol	0.852 ± 0.030	0.829 ± 0.024	0.813 ± 0.050	0.83 ± 0.040
β-turn	0.858 ± 0.024	0.829 ± 0.031	0.815 ± 0.044	0.83 ± 0.039
α-helix	0.852 ± 0.029	0.832 ± 0.029	0.812 ± 0.036	0.83 ± 0.036
LC	0.856 ± 0.026	0.826 ± 0.033	0.814 ± 0.046	0.83 ± 0.040
F_IDR	0.860 ± 0.023	0.830 ± 0.031	0.812 ± 0.053	0.83 ± 0.043
W	0.861 ± 0.032	0.807 ± 0.030	0.791 ± 0.046	0.82 ± 0.047
SHD	0.859 ± 0.033	0.803 ± 0.030	0.793 ± 0.055	0.82 ± 0.050
N	0.853 ± 0.031	0.777 ± 0.038	0.703 ± 0.063	0.78 ± 0.076
L	0.844 ± 0.032	0.786 ± 0.036	0.695 ± 0.060	0.77 ± 0.076
C	0.783 ± 0.024	0.723 ± 0.020	0.684 ± 0.040	0.73 ± 0.050

Note: In the final model FLFB, the kept six features include F_IDR, SHD, as well as the component of W, N, L and C, respectively. Their feature names and the AUROC values of model are shown in bold.
Abbreviations: AUROC, area under the receiver operating characteristics; FAR, fraction of aromatic residues; FCR, fraction of charged residues; FHR, fraction of hydrophobic residues; FLFB, final LLPS-Fibrils-Background three-class model; LC, low complexity; LLPS, liquid–liquid phase separation; NCPR, net charge per residue; SHAP, SHapley Additive exPlanations.

2.4 Proteomes analysis based on FLFB

To assess the consistency of prediction results between the FLFB model and existing predictors for LLPS proteins and fibrils proteins, we compared the prediction results using protein sequences from the human proteome (excluding sequences in the “LLPS,” “Fibrils,” and “Background” datasets). The sequences classified by the FLFB model were compared with predictions from the PSPredictor (Chu et al., 2022) for LLPS proteins and TAPASS proposed by Falgarone et al. (2022) for fibrils proteins. PSPredictor can efficiently perform large-scale computations and exhibits excellent performance in distinguishing human proteomes and LLPS proteins in a recent evaluation (Liao et al., 2023). TAPASS effectively improved the prediction of amyloids by incorporating structural information. Out of the 19,269 proteins analyzed, ~27.20% were predicted to be LLPS proteins by PSPredictor, around 24.70% were predicted to form fibrils by TAPASS. There were 2647 sequences (~13.74%) predicted to be LLPS proteins and fibrils proteins by both predictors. The FLFB model classified 14.19% of the sequences as LLPS proteins, 5.85% as fibrils proteins, and the remaining 79.96% as general proteins, meaning neither LLPS proteins nor fibrils proteins. The cross-overlapped number of sequences predicted by the three predictors is summarized in Table 4, with the numerical values within bracket representing the averaged FLFB scores of sequences in the corresponding group.

TABLE 4. The cross-overlapped number of predicted sequences through PSPredictor (for LLPS proteins), TAPASS (for fibrils proteins), and FLFB (for LLPS, fibrils and general proteins) on human proteome.

	LLPS	Fibrils	Background
	FLFB
Neither PSPredictor nor TAPASS	671 (0.48)	940 (0.50)	10,305 (0.63)
Only PSPredictor	802 (0.56)	63 (0.51)	1729 (0.59)
Only TAPASS	266 (0.50)	72 (0.50)	1774 (0.62)
Both PSPredictor and TAPASS	996 (0.55)	52 (0.48)	1599 (0.58)

Note: The value within bracket denotes the average FLFB score of the sequences within the corresponding group. “Neither PSPredictor nor TAPASS” represents the group within which the sequences were predicted neither by PSPredictor nor by TAPASS; and “Both PSPredictor and TAPASS” means the group within which the sequences were predicted by both PSPredictor and TAPASS.
Abbreviations: FLFB, Feature-based LLPS-Fibrils-Background protein predictor; LLPS, liquid–liquid phase separation.

Before discussing the results presented in Table 4, it is important to note two points. First, in the TAPASS pipeline, three predictors of protein aggregation – ArchCandy2.0 (Ahmed et al., 2015), Pasta2.0 (Walsh et al., 2014), and TANGO (Fernandez-Escamilla et al., 2004), were used to identify aggregation regions in protein sequences, then only the regions with a minimum of 80% disordered residues were identified as exposed aggregation regions (EARs), which exclude those regions hidden within the 3D structure of folded proteins. In this study, a protein is recognized as a fibrils protein if it contains at least one EAR as predicted by any of the three predictors. Second, it is important to acknowledge that proteins classified as “background” are not definitively non-“LLPS” or non-“Fibril.” This uncertainty arises due to the lack of experimental validation for a substantial fraction of proteins in the human proteome. Consequently, the “background” dataset may contain false negatives, which can influence the performance of the three-class classification model. The columns in Table 4 present the number of sequences classified into each group by the FLFB model. For the “LLPS” group, ~65.74% of the sequences were also predicted as LLPS proteins by PSPredictor. Notably, the 802 sequences exclusively identified by PSPredictor (not by TAPASS) exhibited an average probability score of 0.56, slightly higher than those in other groups. While, regarding the “Fibril” group, around 11.00% of the sequences were also predicted as fibrils proteins by TAPASS, and only 72 sequences exclusively identified by it (not by PSPredictor). For the “background” group, 66.88% of the sequences were not predicted either as LLPS proteins by PSPredictor and or as fibrils proteins by TAPASS, with the average probability score of 0.63, slightly higher than those of the other groups. These results indicate a certain level of consistency between the sequences identified by the FLFB model and PSPredictor for LLPS proteins, as well as between the sequences identified by the FLFB model and TAPASS for fibrils proteins.

A number of tools have been developed to predict proteins forming fibrils or undergoing LLPS. Among the predictors for fibrils protein, such as AGGRESCAN (Conchillo-Sole et al., 2007), AgMata (Orlando et al., 2020), AmyloGram (Burdukiewicz et al., 2017), ANuPP (Prabakaran, Rawat, Kumar, & Michael Gromiha, 2021), and so forth, we choose TAPASS due to that it integrates different predictors and screens EARs within disordered regions. For LLPS proteins, apart from PSPredictor, predictors include DeepPhase (Saar et al., 2021), FuzDrop (Hardenberg et al., 2020), LLPhysScore (Cai et al., 2022), dSCOPE (Yu et al., 2023), and PhaSePred (Chen et al., 2022), and so forth. They leverage engineered features and/or natural language models to encode sequence information. A recent evolution indicates besides of PSPredictor, FuzDrop also performs well in identifying LLPS proteins for different types of negative test set (Liao et al., 2023), and meantime, it can label potential regions of aggregation (Vendruscolo & Fuxreiter, 2022). Therefore, in this study, we also compared the predictions between FLFB and FuzDrop. In addition, a recent developed predictor PhaSePred, which uses newest dataset for training, was also chosen for the comparison with FLFB.

The predicted data from FuzDrop and PhaSePred came from ref (Hardenberg et al., 2020) and ref (Chen et al., 2022) (http://predict.phasep.pro/download/), respectively. Due to the Uniprot ID inconsistency (which may arise from the different sequence version), For the 19,269 sequences in human proteome used in this study, 266 of them have no results from FuzDrop, and 247 of them have no results from PhaSePred. Finally, the statistical results are summarized in Tables S5 and S6. Comparing the predicted results in Tables 4, S5, and S6, it exhibits that although there are obvious distinctions between different models, the relative ratios of sequence number between any two groups in each column do not display apparent deviation. On the whole, most of the “background” proteins classified by FLFB are also identified neither to undergo LLPS and nor to form fibrils; most of the “LLPS” proteins classified by FLFB are also identified by three LLPS protein predictors. However, the consistency between the “Fibrils” proteins classified by FLFB with TAPASS is not as good as the former two types. In addition, Table S5 presents that most of the protein in human proteome (occupying 89%) have aggregation-promoting regions with length over 6aa predicted by FuzDrop (Vendruscolo & Fuxreiter, 2022). The accuracy of these predictors requires evaluation by experiments in the future.

2.5 Analysis of droplet-to-aggregate proteins based on FLFB

The FLFB model was developed to investigate the distinction between proteins undergoing LLPS and those forming fibrils, given the limited available data. However, it is worth noting that certain proteins exhibit experimentally observed liquid-to-solid phase transitions, that is, they belong to both LLPS and fibril-forming types. To explore how the FLFB model recognizes proteins undergoing liquid-to-solid phase transitions, we gathered sequences labeled as “droplet-to-fiber/solid/aggregates” from the LLPSDB v2.0 database, then applied CD-hit with a threshold of 0.4, and finally obtained a “droplet-to-aggregate” dataset including 30 sequences. The classification results obtained by applying the FLFB model to these sequences are presented in Table S7. Remarkably, most of the 30 proteins in this dataset (18/30) are identified as LLPS proteins, while only 4 are classified as fibril-forming proteins, and 8 are labeled as background proteins. These findings suggest that proteins undergoing liquid-to-solid transitions share sequence characteristics more similar to those in the “LLPS” dataset. By accumulating additional data on proteins prone to transitioning from liquid condensates to solid-like states in the future, it may be possible to discern their distinct sequence features compared with both “LLPS” and “Fibril” proteins. This knowledge will contribute to the development of a more robust and effective classification model.

3 CONCLUSION

In this study, we compared the distinction between proteins undergoing LLPS and those forming fibrils at sequence level through analyzing 36 selected feature parameters, and 24 of them exhibit significant difference. Based on the screened 24 features, we developed a LLPS-Fibrils binary classification model to examine whether they could be used to distinguish the two groups of proteins, and further a three-class LLPS-Fibrils-Background classification model for rational prediction. Feature analysis reveals the F_IDR residues of proteins was identified to make substantial contributions to the two-class model construction, surpassing the significance of other features. In contrast, cysteine and leucine compositions were highlighted to play prominent roles in the three-class LLPS-Fibrils-Background classification model. Through a systematic feature ablation process, we refined the model with six essential features, and built the final three-class identification model FLFB with a comparative classification performance to those including more features (an average AUROC value of 0.83). FLFB represents a valuable and innovative tool for assigning proteins to their respective groups, and potentially enable efficient identification of their functional characteristics.

4 MATERIALS AND METHODS

4.1 Datasets construction

In this study, we utilized sequences from several databases to analyze the distinctions between LLPS proteins and fibrils proteins. Subsequently, classification models were constructed based on these analyses. We selected sequences with a length ranging from 20 to 3000 amino acids for calculating all feature parameters used in this work.

4.1.1 Dataset of proteins undergoing LLPS

Proteins labeled as forming liquid droplet condensates in “one-protein” systems (where only one type of protein is present in the solution) deposited in the LLPSDB v2.0 database (Wang et al., 2022) were used to build the dataset. The entries recorded in LLPSDB v2.0 have been validated through in vitro experiments, ensuring that the identified proteins form liquid condensates via self-assembly (sequences marked as “droplet to solid-like/fiber/gel-like/aggregate/aggregation” were excluded). Due to the absence of posttranslational modification (PTM) information in the sequences, proteins with PTMs were not included in the dataset. Ultimately, a total of 841 sequences were obtained. To reduce redundancy, we applied CD-hit (Fu et al., 2012) clustering with a threshold of 0.4, and selected representative sequences from each cluster. Consequently, our LLPS dataset consists of 182 sequences, denoted as the “LLPS” dataset.

4.1.2 Dataset of proteins forming amyloid fibrils

The dataset of proteins forming amyloid fibrils was compiled from two databases: AmyPro (Varadi et al., 2018) and CPAD2.0 (Rawat et al., 2020), which provide annotated experimental evidence. From the AmyPro database, we collected 137 protein sequences that contain amyloid-forming regions. Additionally, we obtained 326 peptides and protein regions from the CPAD2.0 database that have been experimentally observed to form aggregates, comprised of 130 amyloid-forming peptides, 182 aggregation-prone regions in amyloidogenic proteins, and 124 sequences with experimental structures determined. To ensure data integrity, any sequences that were duplicated with those in the “LLPS” dataset were excluded. Through CD-hit clustering with a threshold of 0.4, we obtained a final set of 136 sequences, which constitutes the “Fibrils” dataset.

4.1.3 Dataset of background proteins

For the construction of the three-class classification model, background proteins were selected from the human proteome stored in the UniProt database (https://www.uniprot.org/proteomes/UP000005640). We randomly chose 910 non-redundant sequences, which is five times the number of sequences in the “LLPS” dataset. These background sequences were carefully curated to exclude any proteins with abnormal amino acids. This collection of sequences is referred to as the “Background” dataset and serves as the background for training and testing the three-class classification model.

4.2 Selected features

To compare the differences between the “LLPS” and “Fibrils” datasets, we employed a total of 36 potential features, which can be classified into the following five groups:

Composition of 20 amino acids, as well as related features such as FCR (fraction of charged residues), which quantifies the proportion of charged residues (R, K, D, E, and H); NCPR, which calculates the net charge per residue; FAR (fraction of aromatic residues), representing the FAR (F, Y, W); and FHR (fraction of hydrophobic residues), indicating the FHR (A, V, I, L, M, F, Y, and W).
Amino acid distribution pattern: Parameters SCD (proposed by Sawle & Ghosh, 2015) and κ (proposed by Das & Pappu, 2013) were utilized to characterize the distribution of charged residues in protein sequences. Additionally, we employed parameters Ω_Cation–π and Ω_π–π (Martin et al., 2020) to describe sequence patterns involved in Cation–π and π–π interactions, respectively. These parameters were calculated similarly to κ, but with different residue groups (aromatic residues were considered as the origin of π interaction). For sequence patterns associated with hydrophobic interactions, we employed the parameter SHD (Zheng et al., 2020), which was calculated using a formula similar to SCD. Furthermore, the parameter ν (Zheng et al., 2020), combined SCD and SHD, was also selected as a feature in this study. The formulas of each parameter described above are presented as the following:

κ, Ω_Cation–π and Ω_π–π:
$\upsigma =\frac{{\left({F}_{\mathrm{group}1}-{F}_{\mathrm{group}2}\right)}^2}{{\left({F}_{\mathrm{group}1}+{F}_{\mathrm{group}2}\right)}^2}$ ()

$\updelta =\frac{\sum_{i=1}^{N_{blob}}{\left({\sigma}_i-\sigma \right)}^2}{N_{blob}}$ ()

$\kappa, {\Omega}_{\mathrm{Cation}-\uppi},{\Omega}_{\uppi -\uppi}=\frac{\delta }{\delta_{\mathrm{max}}}$ ()
where F_group1 represents the fraction of positive residues (R and K), F_group2 represents the fraction of navigate residues (D and E), and the length of the blob is 5 or 6 residues. To maximize δ, sequence disruption is required. When calculating cation–π interactions, group2 pertains to aromatic residues (F, Y, and W), and when calculating π–π interactions, both groups encompass aromatic and other residues.

SCD:
$\mathrm{SCD}=\frac{1}{N}\left[\sum \limits_{m=2}^N\sum \limits_{n=1}^{m-1}{q}_m{q}_n{\left(m-n\right)}^{1/2}\right]$ ()
where N is the total length of the sequence, m and n represent the positions of residues within the sequence, and q_m is the charge of the mth residue. In this study, we assigned a charge (q) of 1 for positively charged residues (R and K), −1 for negatively charged residues (D and E), and 0 for all other residues.

SHD:
$\mathrm{SHD}={\mathrm{N}}^{-1}\sum \limits_n{\sum}_{m,m>n}\left({\lambda}_n+{\lambda}_m\right){\left|m-n\right|}^{\beta }$ ()
in which λ represents the hydropathy value of residues, and β is set to −1 to account for the contribution of sequence separation.

ν:
$\nu =-{0.0423}^{\ast}\mathrm{SHD}+{0.0074}^{\ast}\mathrm{SCD}+0.701$ ()

All the aforementioned features were calculated using the localCIDER package (Holehouse et al., 2017), except for SCD, SHD, and ν, which were computed based on their respective formulas (Sawle & Ghosh, 2015; Zheng et al., 2020).
Protein secondary structure features: This group consists of propensities of α-helix, β-turn, and β-sheet, which were calculated using the “Protein Analysis” module from the Biopython package (Cock et al., 2009).
Protein solubility (P_sol): This physicochemical feature, which is closely associated with both LLPS and fibril formation, was chosen as a parameter in this study. P_sol was calculated using the online server Protein-sol (https://protein-sol.manchester.ac.uk/), which is a linear model that combines 10 sequence-based features to predict protein solubility (Hebditch et al., 2017).
Other features: This group includes the consideration of LC regions and the F_IDR in each protein sequence. The LC regions were determined using the SEG algorithm (Wootton & Federhen, 1993), and the parameter LC represents the proportion of the length of these regions to the overall sequence length. In addition, there are many tools available for evaluating protein predisposition to intrinsic disorder, including well-known predictors such as PONDR (Xue et al., 2010), IUPred3 (Erdos et al., 2021), Espritz (Walsh et al., 2012), and D²P² (Oates et al., 2013), and each of them may provide rather different outputs. In this work, we chose IUPred3 for calculating the F_IDR due to it is practically efficient to process a large number of sequences.

To assess the differences between the “LLPS” and “Fibrils” datasets for each feature, the Kolmogorov–Smirnov test (Smirnov, 1939) was employed. Features with a p-value <0.05 were selected for the subsequent construction of classification models.

4.3 Classification model and evaluation

4.3.1 RF classification model

The RF model, a tree-based machine-learning algorithm, known for its efficiency and excellent performance in handling tabular data, was employed in this study (Breiman, 2001). In this study, the RF model was constructed using the scikit-learn package in Python (Pedregosa et al., 2011), with the following parameters: n_estimators = 100, max_depth = 3, seed = 42, and default values for other parameters.

4.3.2 Model training

To train the LLPS-Fibrils binary classification model, a 5-fold cross-validation strategy was employed to evaluate the RF model's performance in distinguishing between LLPS proteins and fibrils proteins.

For the LLPS-Fibrils-Background classification model, it was crucial to balance the samples from the “LLPS” dataset, “Fibrils” dataset, and “Background” dataset to ensure reliable results. In each of the five training rounds, 182 sequences from the “Background” dataset (equivalent to the number of sequences in “LLPS” dataset) were randomly sampled without replacement. Out of these, 4/5 sequences were used for training, while 1/5 sequences were used for testing. The sampling strategy for training and testing from “LLPS” dataset and “Fibrils” dataset was same as that used in the LLPS-Fibrils identification model. The average score and corresponding standard deviation of the evaluation parameters for the five rounds were recorded as the classifying performance. For the final three-class classification model FLFB, all the sequences were used for training. Since the “Background” dataset was divided into five parts, five classifiers were obtained. The score of the final model was determined as the averaged result of the five classifiers.

4.3.3 Model evaluation parameters

The AUROC was calculated for evaluation. In addition, the accuracy of prediction, MCC, Sensitivity, Specificity, and F1 score were also calculated.

\mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}

()

\mathrm{MCC}=\frac{\mathrm{TP}\times \mathrm{TN}-\mathrm{FP}\times \mathrm{FN}}{\sqrt{\left(\mathrm{TP}+\mathrm{FP}\right)\times \left(\mathrm{TP}+\mathrm{FN}\right)\times \left(\mathrm{TN}+\mathrm{FP}\right)\times \left(\mathrm{TN}+\mathrm{FN}\right)}}

()

\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}

()

\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}

()

\mathrm{F}1\ \mathrm{Score}=2\times \frac{{\mathrm{Precision}}^{\ast}\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}

()

In the above equations, TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.

4.3.4 Feature analysis via SHAP and “Feature_importance”

In order to explain the classification models, we utilized the SHAP (SHapley Additive exPlanations) approach, which is a game-theoretic approach for explaining the output of machine-learning models (Lundberg & Lee, 2017). SHAP assigns an importance value to each feature for a particular prediction. The predicted score is obtained by combining the importance values linearly. Consequently, the importance values reflect the contribution of selected features to the overall prediction, and the average absolute SHAP value represents the importance of each feature for the model's decision-making process.

Additionally, we used the “Feature_importances” function from the scikit-learn package (Pedregosa et al., 2011). This function calculates the sum of gains after each branch of the decision trees for all the features, thereby indicating the feature importance. It provides insights into how much each feature contributes to the overall performance of the model.

To explain the classification models in this study, we calculated both the SHAP values and “Feature_importances” for all the features included in the two-class and three-class classification models.

AUTHOR CONTRIBUTIONS

Zhuqing Zhang: Conceptualization; methodology; supervision; funding acquisition; writing—original draft; resources; project administration; writing—review and editing; formal analysis. Shaofeng Liao: Methodology; validation; visualization; investigation; software; formal analysis; writing—original draft. Yujun Zhang: Data curation; formal analysis; investigation; methodology; validation; visualization. Xinchen Han: Methodology; validation; investigation; formal analysis; software; writing—original draft. Tinglan Wang: Software; visualization; validation; writing—review and editing. Xi Wang: Data curation; formal analysis; conceptualization. Qinglin Yan: Visualization; validation. Qian Li: Formal analysis; data curation. Yifei Qi: Methodology; supervision; writing—original draft; writing—review and editing; project administration.

ACKNOWLEDGMENTS

We thank Prof. Minghua Deng at Peking University for his helpful discussion, Dr. Théo Falgarone for the help in calculating fibrils proteins on human proteome using TAPASS.

FUNDING INFORMATION

National Natural Science Foundation of China [Grant Number: 32071250, 31870718, 22033001, and 21633001] and by the Fundamental Research Funds for the Central Universities.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflict of interest.

Open Research

DATA AVAILABILITY STATEMENT

The source codes of FLFB are available on GitHub (https://github.com/ucaszqzhang/FLFB), and the corresponding online webserver can be accessed at http://bio-comp.ucas.ac.cn/onlineserver/FLFB/.

Supporting Information

REFERENCES

Ahmad A, Uversky VN, Khan RH. Aberrant liquid-liquid phase separation and amyloid aggregation of proteins related to neurodegenerative diseases. Int J Biol Macromol. 2022; 220: 703–720.
10.1016/j.ijbiomac.2022.08.132
CAS PubMed Web of Science® Google Scholar
Ahmed AB, Znassi N, Château MT, Kajava AV. A structure-based approach to predict predisposition to amyloidosis. Alzheimers Dement. 2015; 11(6): 681–690.
10.1016/j.jalz.2014.06.007
PubMed Web of Science® Google Scholar
Alberti S, Gladfelter A, Mittag T. Considerations and challenges in studying liquid-liquid phase separation and biomolecular condensates. Cell. 2019; 176(3): 419–434.
10.1016/j.cell.2018.12.035
CAS PubMed Web of Science® Google Scholar
Banani SF, Lee HO, Hyman AA, Rosen MK. Biomolecular condensates: organizers of cellular biochemistry. Nat Rev Mol Cell Biol. 2017; 18: 285–298.
10.1038/nrm.2017.7
CAS PubMed Web of Science® Google Scholar
Boeynaems S, Alberti S, Fawzi NL, Mittag T, Polymenidou M, Rousseau F, et al. Protein phase separation: a new phase in cell biology. Trends Cell Biol. 2018; 28: 420–435.
10.1016/j.tcb.2018.02.004
CAS PubMed Web of Science® Google Scholar
Breiman L. Random forests. Mach Learn. 2001; 45(1): 5–32.
10.1023/A:1010933404324
Web of Science® Google Scholar
Bremer A, Farag M, Borcherds WM, Peran I, Martin EW, Pappu RV, et al. Deciphering how naturally occurring sequence features impact the phase behaviours of disordered prion-like domains. Nat Chem. 2022; 14(2): 196–207.
10.1038/s41557-021-00840-w
CAS PubMed Web of Science® Google Scholar
Burdukiewicz M, Sobczyk P, Rödiger S, Duda-Madej A, Mackiewicz P, Kotulska M. Amyloidogenic motifs revealed by n-gram analysis. Sci Rep. 2017; 7(1): 12961.
10.1038/s41598-017-13210-9
PubMed Google Scholar
Cai H, Vernon RM, Forman-Kay JD. An interpretable machine-learning algorithm to predict disordered protein phase separation based on biophysical interactions. Biomolecules. 2022; 12(8): 1131.
10.3390/biom12081131
CAS PubMed Web of Science® Google Scholar
Chen Z, Hou C, Wang L, Yu C, Chen T, Shen B, et al. Screening membraneless organelle participants with machine-learning models that integrate multimodal features. Proc Natl Acad Sci U S A. 2022; 119(24):e2115369119.
10.1073/pnas.2115369119
CAS PubMed Web of Science® Google Scholar
Chu XQ, Sun T, Li Q, Xu Y, Zhang Z, Lai L, et al. Prediction of liquid-liquid phase separating proteins using machine learning. Bmc Bioinformatics. 2022; 23(1): 72.
10.1186/s12859-022-04599-w
CAS PubMed Web of Science® Google Scholar
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11): 1422–1423.
10.1093/bioinformatics/btp163
CAS PubMed Web of Science® Google Scholar
Conchillo-Sole O, de Groot NS, Avilés FX, Vendrell J, Daura X, Ventura S. AGGRESCAN: a server for the prediction and evaluation of "hot spots" of aggregation in polypeptides. BMC Bioinformatics. 2007; 8: 1–17.
10.1186/1471-2105-8-65
PubMed Web of Science® Google Scholar
Das RK, Pappu RV. Conformations of intrinsically disordered proteins are influenced by linear sequence distributions of oppositely charged residues. Proc Natl Acad Sci U S A. 2013; 110(33): 13392–13397.
10.1073/pnas.1304749110
CAS PubMed Web of Science® Google Scholar
Dignon GL, Best RB, Mittal J. Biomolecular phase separation: from molecular driving forces to macroscopic properties. Annu Rev Phys Chem. 2020; 71: 53–75.
10.1146/annurev-physchem-071819-113553
CAS PubMed Web of Science® Google Scholar
Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradović Z. Intrinsic disorder and protein function. Biochemistry. 2002; 41(21): 6573–6582.
10.1021/bi012159+
CAS PubMed Web of Science® Google Scholar
Erdos G, Pajkos M, Dosztanyi Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021; 49(W1): W297–W303.
10.1093/nar/gkab408
CAS PubMed Web of Science® Google Scholar
Falgarone T, Villain É, Guettaf A, Leclercq J, Kajava AV. TAPASS: tool for annotation of protein amyloidogenicity in the context of other structural states. J Struct Biol. 2022; 214(1):107840.
10.1016/j.jsb.2022.107840
CAS PubMed Web of Science® Google Scholar
Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L. Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat Biotech. 2004; 22(10): 1302–1306.
10.1038/nbt1012
CAS PubMed Web of Science® Google Scholar
Fonin AV, Antifeeva IA, Kuznetsova IM, Turoverov KK, Zaslavsky BY, Kulkarni P, et al. Biological soft matter: intrinsically disordered proteins in liquid-liquid phase separation and biomolecular condensates. Essays Biochem. 2022; 66: 831–847.
10.1042/EBC20220052
CAS PubMed Web of Science® Google Scholar
Fu LM, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23): 3150–3152.
10.1093/bioinformatics/bts565
CAS PubMed Web of Science® Google Scholar
Go C, Knight JD, Rajasekharan A, Rathod B, Hesketh GG, Abe KT, et al. A proximity biotinylation map of a human cell. Mol Cell Proteomics. 2019; 18(8):S24.
Google Scholar
Hardenberg M, Horvath A, Ambrus V, Fuxreiter M, Vendruscolo M. Widespread occurrence of the droplet state of proteins in the human proteome. Proc Natl Acad Sci U S A. 2020; 117(52): 33254–33262.
10.1073/pnas.2007670117
CAS PubMed Web of Science® Google Scholar
Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein-sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017; 33(19): 3098–3100.
10.1093/bioinformatics/btx345
CAS PubMed Web of Science® Google Scholar
Holehouse AS, Das RK, Ahad JN, Richardson MOG, Pappu RV. CIDER: resources to analyze sequence-ensemble relationships of intrinsically disordered proteins. Biophys J. 2017; 112(1): 16–21.
10.1016/j.bpj.2016.11.3200
CAS PubMed Web of Science® Google Scholar
King OD, Gitler AD, Shorter J. The tip of the iceberg: RNA-binding proteins with prion-like domains in neurodegenerative disease. Brain Res. 2012; 1462: 61–80.
10.1016/j.brainres.2012.01.016
CAS PubMed Web of Science® Google Scholar
Li Q, Peng X, Li Y, Tang W, Zhu J, Huang J, et al. LLPSDB: a database of proteins undergoing liquid-liquid phase separation in vitro. Nucleic Acids Res. 2020; 48(D1): D320–D327.
10.1093/nar/gkz778
CAS PubMed Web of Science® Google Scholar
Li Q, Wang X, Dou Z, Yang W, Huang B, Lou J, et al. Protein databases related to liquid-liquid phase separation. Int J Mol Sci. 2020; 21(18): 6796.
10.3390/ijms21186796
CAS PubMed Web of Science® Google Scholar
Liao SF, Zhang Y, Qi Y, Zhang Z. Evaluation of sequence-based predictors for phase-separating protein. Brief Bioinform. 2023; 24: bbad213.
10.1093/bib/bbad213
PubMed Web of Science® Google Scholar
Lin Y, Forman-Kay JD, Chan HS. Sequence-specific Polyampholyte phase separation in Membraneless organelles. Phys Rev Lett. 2016; 117(17):178101.
10.1103/PhysRevLett.117.178101
PubMed Web of Science® Google Scholar
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neur in. 2017; 30: 4765–4776.
Google Scholar
Lyon AS, Peeples WB, Rosen MK. A framework for understanding the functions of biomolecular condensates across scales. Nat Rev Mol Cell Biol. 2021; 22(3): 215–235.
10.1038/s41580-020-00303-z
CAS PubMed Web of Science® Google Scholar
Martin EW, Holehouse AS, Peran I, Farag M, Incicco JJ, Bremer A, et al. Valence and patterning of aromatic residues determine the phase behavior of prion-like domains. Science. 2020; 367(6478): 694–699.
10.1126/science.aaw8653
CAS PubMed Web of Science® Google Scholar
Martin EW, Mittag T. Relationship of sequence and phase separation in protein low-complexity regions. Biochemistry. 2018; 57(17): 2478–2487.
10.1021/acs.biochem.8b00008
CAS PubMed Web of Science® Google Scholar
Meszaros B, Erdős G, Szabó B, Schád É, Tantos Á, Abukhairan R, et al. PhaSePro: the database of proteins driving liquid-liquid phase separation. Nucleic Acids Res. 2020; 48(D1): D360–D367.
CAS PubMed Web of Science® Google Scholar
Michaels TCT, Šarić A, Curk S, Bernfur K, Arosio P, Meisl G, et al. Dynamics of oligomer populations formed during the aggregation of Alzheimer's Abeta42 peptide. Nat Chem. 2020; 12(5): 445–451.
10.1038/s41557-020-0452-1
CAS PubMed Web of Science® Google Scholar
Ning WS, Guo Y, Lin S, Mei B, Wu Y, Jiang P, et al. DrLLPS: a data resource of liquid-liquid phase separation in eukaryotes. Nucleic Acids Res. 2020; 48(D1): D288–D295.
10.1093/nar/gkz1027
CAS PubMed Web of Science® Google Scholar
Oates ME, Romero P, Ishida T, Ghalwash M, Mizianty MJ, Xue B, et al. D²P²: database of disordered protein predictions. Nucleic Acids Res. 2013; 41(D1): D508–D516.
10.1093/nar/gks1226
CAS PubMed Web of Science® Google Scholar
Orlando G, Silva A, Macedo-Ribeiro S, Raimondi D, Vranken W. Accurate prediction of protein beta-aggregation with generalized statistical potentials. Bioinformatics. 2020; 36(7): 2076–2081.
10.1093/bioinformatics/btz912
CAS PubMed Web of Science® Google Scholar
Patel A, Lee HO, Jawerth L, Maharana S, Jahnel M, Hein MY, et al. A liquid-to-solid phase transition of the ALS protein FUS accelerated by disease mutation. Cell. 2015; 162(5): 1066–1077.
10.1016/j.cell.2015.07.047
CAS PubMed Web of Science® Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011; 12: 2825–2830.
Web of Science® Google Scholar
Prabakaran R, Rawat P, Kumar S, Michael Gromiha M. ANuPP: a versatile tool to predict aggregation nucleating regions in peptides and proteins. J Mol Biol. 2021; 433(11):166707.
10.1016/j.jmb.2020.11.006
CAS PubMed Web of Science® Google Scholar
Prabakaran R, Rawat P, Thangakani AM, Kumar S, Gromiha MM. Protein aggregation: in silico algorithms and applications. Biophys Rev. 2021; 13(1): 71–89.
10.1007/s12551-021-00778-w
CAS PubMed Google Scholar
Raimondi D, Orlando G, Michiels E, Pakravan D, Bratek-Skicki A, van den Bosch L, et al. In silico prediction of in vitro protein liquid-liquid phase separation experiments outcomes with multi-head neural attention. Bioinformatics. 2021; 37(20): 3473–3479.
10.1093/bioinformatics/btab350
CAS PubMed Web of Science® Google Scholar
Rawat P, Prabakaran R, Sakthivel R, Mary Thangakani A, Kumar S, Gromiha MM. CPAD 2.0: a repository of curated experimental data on aggregating proteins and peptides. Amyloid. 2020; 27(2): 128–133.
10.1080/13506129.2020.1715363
CAS PubMed Web of Science® Google Scholar
Saar KL, Morgunov AS, Qi R, Arter WE, Krainer G, Lee AA, et al. Learning the molecular grammar of protein condensates from sequence determinants and embeddings. Proc Natl Acad Sci U S A. 2021; 118(15):e2019053118.
10.1073/pnas.2019053118
CAS PubMed Web of Science® Google Scholar
Sawle L, Ghosh K. A theoretical method to compute sequence dependent configurational properties in charged polymers and proteins. J Chem Phys. 2015; 143(8):085101.
10.1063/1.4929391
PubMed Web of Science® Google Scholar
Shen BY, Chen Z, Yu C, Chen T, Shi M, Li T. Computational screening of phase-separating proteins. Genom Proteom Bioinf. 2021; 19(1): 13–24.
10.1016/j.gpb.2020.11.003
CAS PubMed Web of Science® Google Scholar
Shin Y, Brangwynne CP. Liquid phase condensation in cell physiology and disease. Science. 2017; 357(6357):eaaf4382.
10.1126/science.aaf4382
PubMed Web of Science® Google Scholar
Smirnov N. On estimates of divergence of two empirical distribution curves for two independent samples. Bull Moskovsk Universiteta Matematika. 1939; 2: 3–14.
PubMed Google Scholar
van Mierlo G, Jansen JRG, Wang J, Poser I, van Heeringen SJ, Vermeulen M. Predicting protein condensate formation using machine learning. Cell Rep. 2021; 34(5):108705.
10.1016/j.celrep.2021.108705
PubMed Web of Science® Google Scholar
Varadi M, de Baets G, Vranken WF, Tompa P, Pancsa R. AmyPro: a database of proteins with validated amyloidogenic regions. Nucleic Acids Res. 2018; 46(D1): D387–D392.
10.1093/nar/gkx950
CAS PubMed Web of Science® Google Scholar
Vendruscolo M, Fuxreiter M. Sequence determinants of the aggregation of proteins within condensates generated by liquid-liquid phase separation. J Mol Biol. 2022; 434(1):167201.
10.1016/j.jmb.2021.167201
CAS PubMed Web of Science® Google Scholar
Vernon RM, Forman-Kay JD. First-generation predictors of biological protein phase separation. Curr Opin Struct Biol. 2019; 58: 88–96.
10.1016/j.sbi.2019.05.016
CAS PubMed Web of Science® Google Scholar
Walsh I, Martin AJM, di Domenico T, Tosatto SCE. ESpritz: accurate and fast prediction of protein disorder. Bioinformatics. 2012; 28(4): 503–509.
10.1093/bioinformatics/btr682
CAS PubMed Web of Science® Google Scholar
Walsh I, Seno F, Tosatto SCE, Trovato A. PASTA 2.0: an improved server for protein aggregation prediction. Nucleic Acids Res. 2014; 42(W1): W301–W307.
10.1093/nar/gku399
CAS PubMed Web of Science® Google Scholar
Wang X, Zhou X, Yan Q, Liao S, Tang W, Xu P, et al. LLPSDB v2.0: an updated database of proteins undergoing liquid-liquid phase separation in vitro. Bioinformatics. 2022; 38(7): 2010–2014.
10.1093/bioinformatics/btac026
CAS PubMed Web of Science® Google Scholar
Wegmann S, Eftekharzadeh B, Tepper K, Zoltowska KM, Bennett RE, Dujardin S, et al. Tau protein liquid-liquid phase separation can initiate tau aggregation. EMBO J. 2018; 37(7): e98049.
10.15252/embj.201798049
PubMed Web of Science® Google Scholar
Wootton JC, Federhen S. Statistics of local complexity in amino-acid-sequences and sequence databases. Comput Chem. 1993; 17(2): 149–163.
10.1016/0097-8485(93)85006-X
CAS Web of Science® Google Scholar
Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR-FIT: a meta-predictor of intrinsically disordered amino acids. Biochim Biophys Acta. 2010; 1804(4): 996–1010.
10.1016/j.bbapap.2010.01.011
CAS PubMed Web of Science® Google Scholar
You KQ, Huang Q, Yu C, Shen B, Sevilla C, Shi M, et al. PhaSepDB: a database of liquid-liquid phase separation related proteins. Nucleic Acids Res. 2020; 48(D1): D354–D359.
10.1093/nar/gkz847
CAS PubMed Web of Science® Google Scholar
Youn JY, Dyakov BJA, Zhang J, Knight JDR, Vernon RM, Forman-Kay JD, et al. Properties of stress granule and P-body proteomes. Mol Cell. 2019; 76(2): 286–294.
10.1016/j.molcel.2019.09.014
CAS PubMed Web of Science® Google Scholar
Yu K, Liu Z, Cheng H, Li S, Zhang Q, Liu J, et al. dSCOPE: a software to detect sequences critical for liquid-liquid phase separation. Brief Bioinform. 2023; 24(1): bbac550.
10.1093/bib/bbac550
PubMed Web of Science® Google Scholar
Zhang H, Ji X, Li P, Liu C, Lou J, Wang Z, et al. Liquid-liquid phase separation in biology: mechanisms, physiological functions and human diseases. Sci China Life Sci. 2020; 63(7): 953–985.
10.1007/s11427-020-1702-x
PubMed Web of Science® Google Scholar
Zheng WW, Dignon G, Brown M, Kim YC, Mittal J. Hydropathy patterning complements charge patterning to describe conformational preferences of disordered proteins. J Phys Chem Lett. 2020; 11(9): 3408–3415.
10.1021/acs.jpclett.0c00288
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume33, Issue3

March 2024

e4927

A sequence-based model for identifying proteins undergoing liquid–liquid phase separation/forming fibril aggregates via machine learning

Abstract

1 INTRODUCTION

2 RESULTS AND DISCUSSION

2.1 Features with significant difference between “LLPS” and “fibrils”

2.2 LLPS-Fibrils binary classification model

2.3 Three-class LLPS-Fibrils-Background classification model

2.4 Proteomes analysis based on FLFB

2.5 Analysis of droplet-to-aggregate proteins based on FLFB

3 CONCLUSION

4 MATERIALS AND METHODS

4.1 Datasets construction

4.1.1 Dataset of proteins undergoing LLPS

4.1.2 Dataset of proteins forming amyloid fibrils

4.1.3 Dataset of background proteins

4.2 Selected features

4.3 Classification model and evaluation

4.3.1 RF classification model

4.3.2 Model training

4.3.3 Model evaluation parameters

4.3.4 Feature analysis via SHAP and “Feature_importance”

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

A sequence-based model for identifying proteins undergoing liquid–liquid phase separation/forming fibril aggregates via machine learning

Abstract

1 INTRODUCTION

2 RESULTS AND DISCUSSION

2.1 Features with significant difference between “LLPS” and “fibrils”

2.2 LLPS-Fibrils binary classification model

2.3 Three-class LLPS-Fibrils-Background classification model

2.4 Proteomes analysis based on FLFB

2.5 Analysis of droplet-to-aggregate proteins based on FLFB

3 CONCLUSION

4 MATERIALS AND METHODS

4.1 Datasets construction

4.1.1 Dataset of proteins undergoing LLPS

4.1.2 Dataset of proteins forming amyloid fibrils

4.1.3 Dataset of background proteins

4.2 Selected features

4.3 Classification model and evaluation

4.3.1 RF classification model

4.3.2 Model training

4.3.3 Model evaluation parameters

4.3.4 Feature analysis via SHAP and “Feature_importance”

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

FUNDING INFORMATION

CONFLICT OF INTEREST STATEMENT

Open Research

DATA AVAILABILITY STATEMENT

Supporting Information

REFERENCES

Citing Literature

Figures

References

Related

Information