A sequence-based model for identifying proteins undergoing liquid–liquid phase separation/forming fibril aggregates via machine learning
Shaofeng Liao and Yujun Zhang contributed equally to this work.
Reviewing Editor: Nir Ben Tal
Abstract
Liquid–liquid phase separation (LLPS) and the solid aggregate (also referred to as amyloid aggregates) formation of proteins, have gained significant attention in recent years due to their associations with various physiological and pathological processes in living organisms. The systematic investigation of the differences and connections between proteins undergoing LLPS and those forming amyloid fibrils at the sequence level has not yet been explored. In this research, we aim to address this gap by comparing the two types of proteins across 36 features using collected data available currently. The statistical comparison results indicate that, 24 of the selected 36 features exhibit significant difference between the two protein groups. A LLPS-Fibrils binary classification model built on these 24 features using random forest reveals that the fraction of intrinsically disordered residues (FIDR) is identified as the most crucial feature. While, in the further three-class LLPS-Fibrils-Background classification model built on the same screened features, the composition of cysteine and that of leucine show more significant contributions than others. Through feature ablation analysis, we finally constructed a model FLFB (Feature-based LLPS-Fibrils-Background protein predictor) using six refined features, with an average area under the receiver operating characteristics of 0.83. This work indicates using sequence features and a machine learning model, proteins undergoing LLPS or forming amyloid fibrils can be identified.
1 INTRODUCTION
Liquid–liquid phase separation (LLPS) has been recognized to underlie the formation of biomolecular condensates and membraneless organelles (MLOs) in cells, and play a crucial role in various biological processes such as stress responses, RNA metabolism, and chromatin organization (Banani et al., 2017; Boeynaems et al., 2018; Lyon et al., 2021). Experimental evidence suggests that the dysregulation of biomolecular LLPS can contribute to the development of diverse diseases (Shin & Brangwynne, 2017; Zhang et al., 2020). Certain pathogenic mutations in proteins such as FUS, TDP43, and Tau, can cause the transformation of liquid condensates into solid-like aggregates, which are strongly associated with several incurable neurodegenerative diseases (Ahmad et al., 2022; King et al., 2012; Patel et al., 2015; Wegmann et al., 2018). However, it is worth noting that the majority of proteins undergoing LLPS do not form solid aggregates. Conversely, solid aggregates can also be formed via oligomeric intermediates in a homogeneous system, not just through LLPS (Michaels et al., 2020; Vendruscolo & Fuxreiter, 2022). Despite the availability of a number of tools for predicting proteins that undergo LLPS (Chu et al., 2022; Raimondi et al., 2021; Saar et al., 2021; Shen et al., 2021; van Mierlo et al., 2021; Vernon & Forman-Kay, 2019) or that can potentially form amyloid fibrils (Prabakaran, Rawat, Thangakani, et al., 2021), only a few of them, such as FuzDrop (Hardenberg et al., 2020), provide information of droplet-promoting regions and aggregation-promoting regions for a query protein. Therefore, the differences and connections in sequence between these two types of proteins remain unclear.
The underlying driving forces of protein LLPS have been attributed to multivalent inter- and intrainteractions, including electrostatic, Pi–Pi, Cation–Pi, and hydrophobic interactions, and so forth (Dignon et al., 2020). Furthermore, experimental studies have highlighted the crucial role of intrinsically disordered proteins (IDPs) and regions, particularly low complexity (LC) regions, in the formation of biomolecular liquid condensates (Alberti et al., 2019; Fonin et al., 2022; Martin & Mittag, 2018). IDPs are one kind of proteins that have no stable 3D-structure under normal physiological conditions when they are free, but play non-negligible influences in various cellular events (Dunker et al., 2002). In biomolecular LLPS or the formation of MLOs, they not only can form multivalent interactions, but their intrinsic flexibility can contribute the liquid property of biomolecular condensates. The molecular grammar of proteins, especially IDPs, which undergo LLPS, has been explored in extensively, both in experimental and in theoretical and simulated investigations (Bremer et al., 2022; Dignon et al., 2020; Lin et al., 2016; Martin et al., 2020). It indicates that not only the amino acid composition, but also the distribution patterns of amino acids involved in different types of multivalent molecular interactions determine the phase behavior of a protein. In fact, the investigation of amyloid formation mechanisms predates the study of protein LLPS. The various types of multivalent interactions observed in LLPS of proteins are also essential factors in the formation of amyloids. In this study, our focus is on whether quantitative parameters derived from protein sequences can effectively differentiate proteins involved in LLPS and those associated with fibril formation.
Recent advancements have led to the construction of databases dedicated to proteins undergoing LLPS (Go et al., 2019; Li, Peng, et al., 2020; Li, Wang, et al., 2020; Meszaros et al., 2020; Ning et al., 2020; You et al., 2020; Youn et al., 2019) and those forming amyloid fibrils (Prabakaran, Rawat, Thangakani, et al., 2021), providing a solid foundation of data for analyzing the sequence features of these protein types. In this study, we leverage these valuable resources, specifically the recently released LLPSDB v2.0 (Wang et al., 2022) for proteins involved in LLPS, as well as the databases AmyPro (Varadi et al., 2018) and CPAD2.0 (Rawat et al., 2020) for proteins associated with amyloid fibril formation, and systematically compare selected sequence features, including sequence characteristics, molecular interactions, secondary structure propensity, fraction of intrinsically disordered residues (FIDRs) and other relevant features, between the two protein categories. Based on screened features with significant difference, we construct classification models capable of distinguishing between liquid–liquid phase separating proteins (hereafter referred to as “LLPS proteins”), fibril-forming proteins (hereafter referred to as “fibrils proteins”), and other proteins. Different from previous researches which mostly focused on either “LLPS protein” or “fibrils protein,” in this study, we attempt to elucidate the disparity and relationship of sequences between the two protein categories. The screened features that exhibit significant difference and offer more contributions in the classification models may provide new insights into the understanding of mechanism governing related cellular processes.
2 RESULTS AND DISCUSSION
2.1 Features with significant difference between “LLPS” and “fibrils”
To compare the sequence difference between “LLPS protein” or “fibrils protein,” we first constructed an “LLPS” dataset based on LLPSDB v2.0 (Wang et al., 2022) and a “Fibrils” dataset based on AmyPro (Varadi et al., 2018) and CPAD 2.0 (Rawat et al., 2020). The two datasets contain 182 and 136 sequences, respectively (see “Section 4.1” for details).
According to the knowledge from extensive investigations about the mechanism of LLPS and fibril formation of proteins, we selected 36 sequence features for statistical analysis of the two protein categories, which include—composition of single amino acid; proportion of charged, aromatic and hydrophobic residues, respectively; net charge per residue (NCPR); sequence charge decoration (SCD); Sawle & Ghosh, 2015), κ (Das & Pappu, 2013), sequence hydropathy decoration (SHD; Zheng et al., 2020), ΩCation–π and Ωπ–π (Martin et al., 2020), representing amino acid distribution pattern involved in electrostatic (refer to SCD and κ), hydrophobic, cation–π and π–π interaction, respectively; a parameter ν (Zheng et al., 2020) which is combined SCD and SHD; propensity of protein secondary structure α-helix, β-turn, and β-sheet, respectively; protein solubility; FIDR; fraction of sequence LC region. More details about these features can be found in Section 4.2 and Table S1.
The statistical comparison reveals that for the selected 36 features, 24 of them exhibit a significant difference (p-value <0.05) between the “LLPS” and “Fibrils” datasets, as shown in Table S1 and Figure 1. The top three features with the largest distinction include FIDR, “α-helix” propensity and LC with p-values lower than 10−12. LLPS proteins display a higher FIDR, LC, and a lower “α-helix” propensity comparing with fibrils proteins. It reveals that intrinsically disorder and LC may be basic attributes of LLPS proteins. Based on the findings, we construct a LLPS-Fibril protein classification model to examine whether the screened 24 features can be utilized to classify the two types of protein well, and further build a three-class (LLPS-Fibrils-Background) classification model for prediction.

2.2 LLPS-Fibrils binary classification model
In order to gain insights into the importance of the 24 screened features in identifying LLPS proteins from fibrils proteins, we initially construct an LLPS-Fibrils binary classification model through the Random Forest (RF) algorithm and the training strategy described in Section 4.3. In this model, the samples in the “Fibrils” dataset were considered as positives, while those in the “LLPS” dataset were treated as negatives. The results, including evaluation parameters such as area under the receiver operating characteristic curve (AUROC), Precision, Recall, F1 Score, and confusion matrixes are summarized in Table 1, Figure S1, and Table S2. The standard deviation observed for these parameters during the 5-fold cross-validation demonstrates the stability of the model. The average AUROC value of 0.837 suggests that the screened 24 features are effective in distinguishing LLPS proteins from fibrils proteins. However, it should be noted that the F1 Score, Recall, and Matthew's correlation coefficient (MCC) values are relatively low, which may be attributed to the insufficiency of the model or the limitation of the dataset.
AUROC | Precision | Recall |
---|---|---|
0.837 ± 0.058 | 0.750 ± 0.072 | 0.642 ± 0.081 |
F1_score | Accuracy | MCC |
---|---|---|
0.687 ± 0.058 | 0.748 ± 0.069 | 0.487 ± 0.125 |
- Abbreviations: AUROC, area under the receiver operating characteristics; LLPS, liquid–liquid phase separation; MCC, Matthew's correlation coefficient.
We then conduct an analysis using both SHAP and the scikit-learn package “Feature_importances.” Remarkably, both methods yield similar results, as Figure S2 shows that the top four features that contributed significantly to the model are identified as follows: FIDR, α-helix propensity, LC and SHD. Figure 2 shows the distribution of SHAP value of each sequence in datasets, indicating that FIDR emerges as the most influential feature notably, with a higher FIDR value associated with a more positive prediction for LLPS proteins. The significance of intrinsically disorder for protein LLPS has been discussed widely and extensively (Alberti et al., 2019; Fonin et al., 2022), because IDPs/IDRs play an important role not only as “stickers” that the multivalent interactions within coacervates come from, but as “spacers” which provide the flexibility of molecules and keep coacervates as liquid. The results here indicate that LLPS proteins are more disordered than fibril proteins. Meantime, what a bit out of expected is that fibrils proteins show a positive correlation with higher α-helix propensities, since fibril aggregates usually form a “cross-β” structure. In addition, Figure 2 also exhibits that the features such as higher LC and SHD values, as well as larger component of arginine are found to be attributes more commonly in LLPS proteins than fibrils proteins. Overall, the analysis of feature importance using both SHAP and “Feature_importances” highlights the importance weight of the identified features in distinguishing between LLPS proteins and fibrils proteins.

Although a simplistic LLPS-Fibrils binary identification model could not meet the requirements for classifying proteins belonging to both categories or none of them, it offers the information that the screened features may be effective for the classification. Based on this, we further construct a multiclass prediction model as the following.
2.3 Three-class LLPS-Fibrils-Background classification model
In fact, several LLPS proteins have been observed to further form fibrils experimentally, which are associated with diseases in some cases (Patel et al., 2015; Wegmann et al., 2018). However, the number of proteins involved both in LLPS and fibrils is limited, as only 30 sequences (after CD-hit with a cutoff 0.4) were screened from LLPSDB v2.0 with the label “droplet-to-aggregate” (and similar term), letting them not adequate for training as a class. In addition, many proteins may do not undergo LLPS or form fibril aggregates under normal physiological condition, and we call them as general proteins in this work. Until now, there has no a strict dataset of general proteins validated by experiments. To identify “LLPS” and “Fibrils” proteins from the general proteins, we constructed a “Background” dataset using the human proteome, excluding the sequences in the “LLPS” and “Fibrils” datasets. Using the 24 features obtained above, we developed LLPS-Fibrils-Background classification models using RF and the training strategy described in Section 4.3.
The evaluation results presented in Table 2, Table S3, and Figure S3, which reveal that the AUROC values for the three classes are all larger than 0.83, with the one for “Background” type slightly larger than those for other two groups. The classification for the “Fiber” type exhibits relatively lower values of Recall, F1 Score, and MCC, but slightly larger values of Precision and Accuracy. This suggests that the model can accurately identify fibrils proteins with high confidence, but a significant fraction of fibrils proteins may be classified as LLPS or background proteins. Overall, the three-class classification models indicate a reasonable performance in distinguishing between LLPS, fibrils, and general proteins.
Background | LLPS | Fibrils | |
---|---|---|---|
AUROC | 0.856 ± 0.027 | 0.834 ± 0.023 | 0.832 ± 0.043 |
Precision | 0.660 ± 0.0590 | 0.656 ± 0.025 | 0.682 ± 0.035 |
Recall | 0.719 ± 0.039 | 0.731 ± 0.059 | 0.485 ± 0.034 |
F1_score | 0.686 ± 0.033 | 0.690 ± 0.032 | 0.566 ± 0.026 |
Accuracy | 0.760 ± 0.033 | 0.762 ± 0.020 | 0.798 ± 0.011 |
MCC | 0.496 ± 0.060 | 0.502 ± 0.046 | 0.451 ± 0.030 |
- Abbreviations: AUROC, area under the receiver operating characteristics; LLPS, liquid–liquid phase separation; MCC, Matthew's correlation coefficient.
By using SHAP and the scikit-learn package “Feature_importances,” we analyzed the selected features' contribution in the LLPS-Fibrils-Background classification model. Figure 3 indicates that both methods exhibit the top seven significant features in a same order: the component of cysteine, leucine, and asparagine, respectively, SHD, the component of tryptophan, FIDR, as well as LC. It is apparently different from the results obtained in LLPS-Fibrils classification model. For the top two features—the components of cysteine and leucine, Figure S4 shows their distribution comparison between “LLPS,” “Fibrils,” and “Background” datasets. It reveals that while the variations in cysteine and leucine content between the “LLPS” and “Fibrils” datasets may not be as pronounced compared with other features, these two attributes exhibit remarkable distinctions between general proteins in the background dataset and LLPS/fibril proteins, rendering them the most significant features in the three-class identification model. Conversely, the most prominent feature FIDR in the two-class identification model demonstrates moderate importance in the LLPS-Fibrils-Background classification model. Overall, the analysis of feature importance in the LLPS-Fibrils-Background classification model reveals distinctive contributions of the features compared with the binary classification model.

Due to the features used in the models were selected according to experienced understanding and screened through statistical difference (KS-test), there might be certain redundancy among them. In order to optimize these features and reduce the number to achieve comparable performance, we conducted an ablation analysis. Starting with the initial set of 24 features, we progressively removed one feature at a time while retaining the remaining 23 features to train the model. The results in Table S4 indicate that removing a single feature does not significantly impact the classification performance, even when the most important feature, such as the component of cysteine, is removed. This suggests that the remaining features are capable of compensating for the removed feature. Since there are numerous possible combinations of the remaining features, testing all combinations would be time-consuming. To simplify the analysis, we trained the model by progressively removing features based on their importance, as determined by the mean absolute SHAP values in Figure 3a. We started by removing the feature with the lowest importance, which is the component of tyrosine (Y), resulting in a model trained using the left 23 features with the averaged AUROC value of 0.84 ± 0.036 (on line 3 of Table 3). We continued this process, removing additional features in an order of decreasing importance, such as Y + V, Y + V + F, and so forth. The averaged AUROC values of these models are listed in Table 3, revealing that the model's performance almost remains stable (the averaged AUROC value of 0.83) until the feature LC is removed. It means that a model consisting of six features (the component of C, L, N, SHD, W, and FIDR) can achieve comparable identification performance to models with a larger number of features. Based on the ablation analysis, the final LLPS-Fibrils-Background three-class model (FLFB) was trained using these six features. This approach helps to reduce the complexity of the model while maintaining its classification performance.
Feature (retained) | Background | LLPS | Fibrils | 3class_ave |
---|---|---|---|---|
Y | 0.856 ± 0.027 | 0.834 ± 0.023 | 0.832 ± 0.043 | 0.84 ± 0.034 |
V | 0.855 ± 0.027 | 0.831 ± 0.023 | 0.832 ± 0.047 | 0.84 ± 0.036 |
F | 0.856 ± 0.027 | 0.829 ± 0.020 | 0.827 ± 0.050 | 0.84 ± 0.037 |
I | 0.856 ± 0.026 | 0.830 ± 0.025 | 0.830 ± 0.045 | 0.84 ± 0.036 |
NCPR | 0.858 ± 0.028 | 0.830 ± 0.026 | 0.833 ± 0.050 | 0.84 ± 0.038 |
FAR | 0.855 ± 0.028 | 0.830 ± 0.023 | 0.831 ± 0.051 | 0.84 ± 0.038 |
Ωπ–π | 0.855 ± 0.025 | 0.831 ± 0.025 | 0.822 ± 0.051 | 0.84 ± 0.039 |
T | 0.853 ± 0.024 | 0.827 ± 0.023 | 0.818 ± 0.046 | 0.83 ± 0.036 |
FCR | 0.856 ± 0.028 | 0.832 ± 0.025 | 0.822 ± 0.045 | 0.84 ± 0.037 |
P | 0.856 ± 0.028 | 0.832 ± 0.026 | 0.825 ± 0.055 | 0.84 ± 0.041 |
G | 0.856 ± 0.028 | 0.832 ± 0.025 | 0.823 ± 0.052 | 0.84 ± 0.040 |
ν | 0.805 ± 0.031 | 0.829 ± 0.030 | 0.814 ± 0.059 | 0.83 ± 0.045 |
FHR | 0.858 ± 0.029 | 0.827 ± 0.027 | 0.824 ± 0.050 | 0.84 ± 0.040 |
R | 0.858 ± 0.026 | 0.829 ± 0.021 | 0.823 ± 0.053 | 0.84 ± 0.039 |
P_sol | 0.852 ± 0.030 | 0.829 ± 0.024 | 0.813 ± 0.050 | 0.83 ± 0.040 |
β-turn | 0.858 ± 0.024 | 0.829 ± 0.031 | 0.815 ± 0.044 | 0.83 ± 0.039 |
α-helix | 0.852 ± 0.029 | 0.832 ± 0.029 | 0.812 ± 0.036 | 0.83 ± 0.036 |
LC | 0.856 ± 0.026 | 0.826 ± 0.033 | 0.814 ± 0.046 | 0.83 ± 0.040 |
FIDR | 0.860 ± 0.023 | 0.830 ± 0.031 | 0.812 ± 0.053 | 0.83 ± 0.043 |
W | 0.861 ± 0.032 | 0.807 ± 0.030 | 0.791 ± 0.046 | 0.82 ± 0.047 |
SHD | 0.859 ± 0.033 | 0.803 ± 0.030 | 0.793 ± 0.055 | 0.82 ± 0.050 |
N | 0.853 ± 0.031 | 0.777 ± 0.038 | 0.703 ± 0.063 | 0.78 ± 0.076 |
L | 0.844 ± 0.032 | 0.786 ± 0.036 | 0.695 ± 0.060 | 0.77 ± 0.076 |
C | 0.783 ± 0.024 | 0.723 ± 0.020 | 0.684 ± 0.040 | 0.73 ± 0.050 |
- Note: In the final model FLFB, the kept six features include FIDR, SHD, as well as the component of W, N, L and C, respectively. Their feature names and the AUROC values of model are shown in bold.
- Abbreviations: AUROC, area under the receiver operating characteristics; FAR, fraction of aromatic residues; FCR, fraction of charged residues; FHR, fraction of hydrophobic residues; FLFB, final LLPS-Fibrils-Background three-class model; LC, low complexity; LLPS, liquid–liquid phase separation; NCPR, net charge per residue; SHAP, SHapley Additive exPlanations.
2.4 Proteomes analysis based on FLFB
To assess the consistency of prediction results between the FLFB model and existing predictors for LLPS proteins and fibrils proteins, we compared the prediction results using protein sequences from the human proteome (excluding sequences in the “LLPS,” “Fibrils,” and “Background” datasets). The sequences classified by the FLFB model were compared with predictions from the PSPredictor (Chu et al., 2022) for LLPS proteins and TAPASS proposed by Falgarone et al. (2022) for fibrils proteins. PSPredictor can efficiently perform large-scale computations and exhibits excellent performance in distinguishing human proteomes and LLPS proteins in a recent evaluation (Liao et al., 2023). TAPASS effectively improved the prediction of amyloids by incorporating structural information. Out of the 19,269 proteins analyzed, ~27.20% were predicted to be LLPS proteins by PSPredictor, around 24.70% were predicted to form fibrils by TAPASS. There were 2647 sequences (~13.74%) predicted to be LLPS proteins and fibrils proteins by both predictors. The FLFB model classified 14.19% of the sequences as LLPS proteins, 5.85% as fibrils proteins, and the remaining 79.96% as general proteins, meaning neither LLPS proteins nor fibrils proteins. The cross-overlapped number of sequences predicted by the three predictors is summarized in Table 4, with the numerical values within bracket representing the averaged FLFB scores of sequences in the corresponding group.
FLFB | |||
---|---|---|---|
LLPS | Fibrils | Background | |
Neither PSPredictor nor TAPASS | 671 (0.48) | 940 (0.50) | 10,305 (0.63) |
Only PSPredictor | 802 (0.56) | 63 (0.51) | 1729 (0.59) |
Only TAPASS | 266 (0.50) | 72 (0.50) | 1774 (0.62) |
Both PSPredictor and TAPASS | 996 (0.55) | 52 (0.48) | 1599 (0.58) |
- Note: The value within bracket denotes the average FLFB score of the sequences within the corresponding group. “Neither PSPredictor nor TAPASS” represents the group within which the sequences were predicted neither by PSPredictor nor by TAPASS; and “Both PSPredictor and TAPASS” means the group within which the sequences were predicted by both PSPredictor and TAPASS.
- Abbreviations: FLFB, Feature-based LLPS-Fibrils-Background protein predictor; LLPS, liquid–liquid phase separation.
Before discussing the results presented in Table 4, it is important to note two points. First, in the TAPASS pipeline, three predictors of protein aggregation – ArchCandy2.0 (Ahmed et al., 2015), Pasta2.0 (Walsh et al., 2014), and TANGO (Fernandez-Escamilla et al., 2004), were used to identify aggregation regions in protein sequences, then only the regions with a minimum of 80% disordered residues were identified as exposed aggregation regions (EARs), which exclude those regions hidden within the 3D structure of folded proteins. In this study, a protein is recognized as a fibrils protein if it contains at least one EAR as predicted by any of the three predictors. Second, it is important to acknowledge that proteins classified as “background” are not definitively non-“LLPS” or non-“Fibril.” This uncertainty arises due to the lack of experimental validation for a substantial fraction of proteins in the human proteome. Consequently, the “background” dataset may contain false negatives, which can influence the performance of the three-class classification model. The columns in Table 4 present the number of sequences classified into each group by the FLFB model. For the “LLPS” group, ~65.74% of the sequences were also predicted as LLPS proteins by PSPredictor. Notably, the 802 sequences exclusively identified by PSPredictor (not by TAPASS) exhibited an average probability score of 0.56, slightly higher than those in other groups. While, regarding the “Fibril” group, around 11.00% of the sequences were also predicted as fibrils proteins by TAPASS, and only 72 sequences exclusively identified by it (not by PSPredictor). For the “background” group, 66.88% of the sequences were not predicted either as LLPS proteins by PSPredictor and or as fibrils proteins by TAPASS, with the average probability score of 0.63, slightly higher than those of the other groups. These results indicate a certain level of consistency between the sequences identified by the FLFB model and PSPredictor for LLPS proteins, as well as between the sequences identified by the FLFB model and TAPASS for fibrils proteins.
A number of tools have been developed to predict proteins forming fibrils or undergoing LLPS. Among the predictors for fibrils protein, such as AGGRESCAN (Conchillo-Sole et al., 2007), AgMata (Orlando et al., 2020), AmyloGram (Burdukiewicz et al., 2017), ANuPP (Prabakaran, Rawat, Kumar, & Michael Gromiha, 2021), and so forth, we choose TAPASS due to that it integrates different predictors and screens EARs within disordered regions. For LLPS proteins, apart from PSPredictor, predictors include DeepPhase (Saar et al., 2021), FuzDrop (Hardenberg et al., 2020), LLPhysScore (Cai et al., 2022), dSCOPE (Yu et al., 2023), and PhaSePred (Chen et al., 2022), and so forth. They leverage engineered features and/or natural language models to encode sequence information. A recent evolution indicates besides of PSPredictor, FuzDrop also performs well in identifying LLPS proteins for different types of negative test set (Liao et al., 2023), and meantime, it can label potential regions of aggregation (Vendruscolo & Fuxreiter, 2022). Therefore, in this study, we also compared the predictions between FLFB and FuzDrop. In addition, a recent developed predictor PhaSePred, which uses newest dataset for training, was also chosen for the comparison with FLFB.
The predicted data from FuzDrop and PhaSePred came from ref (Hardenberg et al., 2020) and ref (Chen et al., 2022) (http://predict.phasep.pro/download/), respectively. Due to the Uniprot ID inconsistency (which may arise from the different sequence version), For the 19,269 sequences in human proteome used in this study, 266 of them have no results from FuzDrop, and 247 of them have no results from PhaSePred. Finally, the statistical results are summarized in Tables S5 and S6. Comparing the predicted results in Tables 4, S5, and S6, it exhibits that although there are obvious distinctions between different models, the relative ratios of sequence number between any two groups in each column do not display apparent deviation. On the whole, most of the “background” proteins classified by FLFB are also identified neither to undergo LLPS and nor to form fibrils; most of the “LLPS” proteins classified by FLFB are also identified by three LLPS protein predictors. However, the consistency between the “Fibrils” proteins classified by FLFB with TAPASS is not as good as the former two types. In addition, Table S5 presents that most of the protein in human proteome (occupying 89%) have aggregation-promoting regions with length over 6aa predicted by FuzDrop (Vendruscolo & Fuxreiter, 2022). The accuracy of these predictors requires evaluation by experiments in the future.
2.5 Analysis of droplet-to-aggregate proteins based on FLFB
The FLFB model was developed to investigate the distinction between proteins undergoing LLPS and those forming fibrils, given the limited available data. However, it is worth noting that certain proteins exhibit experimentally observed liquid-to-solid phase transitions, that is, they belong to both LLPS and fibril-forming types. To explore how the FLFB model recognizes proteins undergoing liquid-to-solid phase transitions, we gathered sequences labeled as “droplet-to-fiber/solid/aggregates” from the LLPSDB v2.0 database, then applied CD-hit with a threshold of 0.4, and finally obtained a “droplet-to-aggregate” dataset including 30 sequences. The classification results obtained by applying the FLFB model to these sequences are presented in Table S7. Remarkably, most of the 30 proteins in this dataset (18/30) are identified as LLPS proteins, while only 4 are classified as fibril-forming proteins, and 8 are labeled as background proteins. These findings suggest that proteins undergoing liquid-to-solid transitions share sequence characteristics more similar to those in the “LLPS” dataset. By accumulating additional data on proteins prone to transitioning from liquid condensates to solid-like states in the future, it may be possible to discern their distinct sequence features compared with both “LLPS” and “Fibril” proteins. This knowledge will contribute to the development of a more robust and effective classification model.
3 CONCLUSION
In this study, we compared the distinction between proteins undergoing LLPS and those forming fibrils at sequence level through analyzing 36 selected feature parameters, and 24 of them exhibit significant difference. Based on the screened 24 features, we developed a LLPS-Fibrils binary classification model to examine whether they could be used to distinguish the two groups of proteins, and further a three-class LLPS-Fibrils-Background classification model for rational prediction. Feature analysis reveals the FIDR residues of proteins was identified to make substantial contributions to the two-class model construction, surpassing the significance of other features. In contrast, cysteine and leucine compositions were highlighted to play prominent roles in the three-class LLPS-Fibrils-Background classification model. Through a systematic feature ablation process, we refined the model with six essential features, and built the final three-class identification model FLFB with a comparative classification performance to those including more features (an average AUROC value of 0.83). FLFB represents a valuable and innovative tool for assigning proteins to their respective groups, and potentially enable efficient identification of their functional characteristics.
4 MATERIALS AND METHODS
4.1 Datasets construction
In this study, we utilized sequences from several databases to analyze the distinctions between LLPS proteins and fibrils proteins. Subsequently, classification models were constructed based on these analyses. We selected sequences with a length ranging from 20 to 3000 amino acids for calculating all feature parameters used in this work.
4.1.1 Dataset of proteins undergoing LLPS
Proteins labeled as forming liquid droplet condensates in “one-protein” systems (where only one type of protein is present in the solution) deposited in the LLPSDB v2.0 database (Wang et al., 2022) were used to build the dataset. The entries recorded in LLPSDB v2.0 have been validated through in vitro experiments, ensuring that the identified proteins form liquid condensates via self-assembly (sequences marked as “droplet to solid-like/fiber/gel-like/aggregate/aggregation” were excluded). Due to the absence of posttranslational modification (PTM) information in the sequences, proteins with PTMs were not included in the dataset. Ultimately, a total of 841 sequences were obtained. To reduce redundancy, we applied CD-hit (Fu et al., 2012) clustering with a threshold of 0.4, and selected representative sequences from each cluster. Consequently, our LLPS dataset consists of 182 sequences, denoted as the “LLPS” dataset.
4.1.2 Dataset of proteins forming amyloid fibrils
The dataset of proteins forming amyloid fibrils was compiled from two databases: AmyPro (Varadi et al., 2018) and CPAD2.0 (Rawat et al., 2020), which provide annotated experimental evidence. From the AmyPro database, we collected 137 protein sequences that contain amyloid-forming regions. Additionally, we obtained 326 peptides and protein regions from the CPAD2.0 database that have been experimentally observed to form aggregates, comprised of 130 amyloid-forming peptides, 182 aggregation-prone regions in amyloidogenic proteins, and 124 sequences with experimental structures determined. To ensure data integrity, any sequences that were duplicated with those in the “LLPS” dataset were excluded. Through CD-hit clustering with a threshold of 0.4, we obtained a final set of 136 sequences, which constitutes the “Fibrils” dataset.
4.1.3 Dataset of background proteins
For the construction of the three-class classification model, background proteins were selected from the human proteome stored in the UniProt database (https://www.uniprot.org/proteomes/UP000005640). We randomly chose 910 non-redundant sequences, which is five times the number of sequences in the “LLPS” dataset. These background sequences were carefully curated to exclude any proteins with abnormal amino acids. This collection of sequences is referred to as the “Background” dataset and serves as the background for training and testing the three-class classification model.
4.2 Selected features
- Composition of 20 amino acids, as well as related features such as FCR (fraction of charged residues), which quantifies the proportion of charged residues (R, K, D, E, and H); NCPR, which calculates the net charge per residue; FAR (fraction of aromatic residues), representing the FAR (F, Y, W); and FHR (fraction of hydrophobic residues), indicating the FHR (A, V, I, L, M, F, Y, and W).
-
Amino acid distribution pattern: Parameters SCD (proposed by Sawle & Ghosh, 2015) and κ (proposed by Das & Pappu, 2013) were utilized to characterize the distribution of charged residues in protein sequences. Additionally, we employed parameters ΩCation–π and Ωπ–π (Martin et al., 2020) to describe sequence patterns involved in Cation–π and π–π interactions, respectively. These parameters were calculated similarly to κ, but with different residue groups (aromatic residues were considered as the origin of π interaction). For sequence patterns associated with hydrophobic interactions, we employed the parameter SHD (Zheng et al., 2020), which was calculated using a formula similar to SCD. Furthermore, the parameter ν (Zheng et al., 2020), combined SCD and SHD, was also selected as a feature in this study. The formulas of each parameter described above are presented as the following:
κ, ΩCation–π and Ωπ–π:
()()()where Fgroup1 represents the fraction of positive residues (R and K), Fgroup2 represents the fraction of navigate residues (D and E), and the length of the blob is 5 or 6 residues. To maximize δ, sequence disruption is required. When calculating cation–π interactions, group2 pertains to aromatic residues (F, Y, and W), and when calculating π–π interactions, both groups encompass aromatic and other residues.SCD:
()where N is the total length of the sequence, m and n represent the positions of residues within the sequence, and qm is the charge of the mth residue. In this study, we assigned a charge (q) of 1 for positively charged residues (R and K), −1 for negatively charged residues (D and E), and 0 for all other residues.SHD:
()in which λ represents the hydropathy value of residues, and β is set to −1 to account for the contribution of sequence separation.ν:
()All the aforementioned features were calculated using the localCIDER package (Holehouse et al., 2017), except for SCD, SHD, and ν, which were computed based on their respective formulas (Sawle & Ghosh, 2015; Zheng et al., 2020).
- Protein secondary structure features: This group consists of propensities of α-helix, β-turn, and β-sheet, which were calculated using the “Protein Analysis” module from the Biopython package (Cock et al., 2009).
- Protein solubility (P_sol): This physicochemical feature, which is closely associated with both LLPS and fibril formation, was chosen as a parameter in this study. P_sol was calculated using the online server Protein-sol (https://protein-sol.manchester.ac.uk/), which is a linear model that combines 10 sequence-based features to predict protein solubility (Hebditch et al., 2017).
- Other features: This group includes the consideration of LC regions and the FIDR in each protein sequence. The LC regions were determined using the SEG algorithm (Wootton & Federhen, 1993), and the parameter LC represents the proportion of the length of these regions to the overall sequence length. In addition, there are many tools available for evaluating protein predisposition to intrinsic disorder, including well-known predictors such as PONDR (Xue et al., 2010), IUPred3 (Erdos et al., 2021), Espritz (Walsh et al., 2012), and D2P2 (Oates et al., 2013), and each of them may provide rather different outputs. In this work, we chose IUPred3 for calculating the FIDR due to it is practically efficient to process a large number of sequences.
To assess the differences between the “LLPS” and “Fibrils” datasets for each feature, the Kolmogorov–Smirnov test (Smirnov, 1939) was employed. Features with a p-value <0.05 were selected for the subsequent construction of classification models.
4.3 Classification model and evaluation
4.3.1 RF classification model
The RF model, a tree-based machine-learning algorithm, known for its efficiency and excellent performance in handling tabular data, was employed in this study (Breiman, 2001). In this study, the RF model was constructed using the scikit-learn package in Python (Pedregosa et al., 2011), with the following parameters: n_estimators = 100, max_depth = 3, seed = 42, and default values for other parameters.
4.3.2 Model training
To train the LLPS-Fibrils binary classification model, a 5-fold cross-validation strategy was employed to evaluate the RF model's performance in distinguishing between LLPS proteins and fibrils proteins.
For the LLPS-Fibrils-Background classification model, it was crucial to balance the samples from the “LLPS” dataset, “Fibrils” dataset, and “Background” dataset to ensure reliable results. In each of the five training rounds, 182 sequences from the “Background” dataset (equivalent to the number of sequences in “LLPS” dataset) were randomly sampled without replacement. Out of these, 4/5 sequences were used for training, while 1/5 sequences were used for testing. The sampling strategy for training and testing from “LLPS” dataset and “Fibrils” dataset was same as that used in the LLPS-Fibrils identification model. The average score and corresponding standard deviation of the evaluation parameters for the five rounds were recorded as the classifying performance. For the final three-class classification model FLFB, all the sequences were used for training. Since the “Background” dataset was divided into five parts, five classifiers were obtained. The score of the final model was determined as the averaged result of the five classifiers.
4.3.3 Model evaluation parameters
In the above equations, TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.
4.3.4 Feature analysis via SHAP and “Feature_importance”
In order to explain the classification models, we utilized the SHAP (SHapley Additive exPlanations) approach, which is a game-theoretic approach for explaining the output of machine-learning models (Lundberg & Lee, 2017). SHAP assigns an importance value to each feature for a particular prediction. The predicted score is obtained by combining the importance values linearly. Consequently, the importance values reflect the contribution of selected features to the overall prediction, and the average absolute SHAP value represents the importance of each feature for the model's decision-making process.
Additionally, we used the “Feature_importances” function from the scikit-learn package (Pedregosa et al., 2011). This function calculates the sum of gains after each branch of the decision trees for all the features, thereby indicating the feature importance. It provides insights into how much each feature contributes to the overall performance of the model.
To explain the classification models in this study, we calculated both the SHAP values and “Feature_importances” for all the features included in the two-class and three-class classification models.
AUTHOR CONTRIBUTIONS
Zhuqing Zhang: Conceptualization; methodology; supervision; funding acquisition; writing—original draft; resources; project administration; writing—review and editing; formal analysis. Shaofeng Liao: Methodology; validation; visualization; investigation; software; formal analysis; writing—original draft. Yujun Zhang: Data curation; formal analysis; investigation; methodology; validation; visualization. Xinchen Han: Methodology; validation; investigation; formal analysis; software; writing—original draft. Tinglan Wang: Software; visualization; validation; writing—review and editing. Xi Wang: Data curation; formal analysis; conceptualization. Qinglin Yan: Visualization; validation. Qian Li: Formal analysis; data curation. Yifei Qi: Methodology; supervision; writing—original draft; writing—review and editing; project administration.
ACKNOWLEDGMENTS
We thank Prof. Minghua Deng at Peking University for his helpful discussion, Dr. Théo Falgarone for the help in calculating fibrils proteins on human proteome using TAPASS.
FUNDING INFORMATION
National Natural Science Foundation of China [Grant Number: 32071250, 31870718, 22033001, and 21633001] and by the Fundamental Research Funds for the Central Universities.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflict of interest.
Open Research
DATA AVAILABILITY STATEMENT
The source codes of FLFB are available on GitHub (https://github.com/ucaszqzhang/FLFB), and the corresponding online webserver can be accessed at http://bio-comp.ucas.ac.cn/onlineserver/FLFB/.