Volume 32, Issue 9 e4756
TOOLS FOR PROTEIN SCIENCE
Open Access

ParSe 2.0: A web tool to identify drivers of protein phase separation at the proteome level

Colorado Wilson

Colorado Wilson

Department of Chemistry and Biochemistry, Texas State University, San Marcos, Texas, USA

Search for more papers by this author
Karen A. Lewis

Karen A. Lewis

Department of Chemistry and Biochemistry, Texas State University, San Marcos, Texas, USA

Search for more papers by this author
Nicholas C. Fitzkee

Nicholas C. Fitzkee

Department of Chemistry, Mississippi State University, Mississippi State, Mississippi, USA

Search for more papers by this author
Loren E. Hough

Loren E. Hough

Department of Physics, University of Colorado Boulder, Boulder, Colorado, USA

BioFrontiers Institute, University of Colorado Boulder, Boulder, Colorado, USA

Search for more papers by this author
Steven T. Whitten

Corresponding Author

Steven T. Whitten

Department of Chemistry and Biochemistry, Texas State University, San Marcos, Texas, USA

Correspondence

Steven T. Whitten, Department of Chemistry and Biochemistry, Texas State University, 601 University Drive, San Marcos, TX 78666, USA.

Email: [email protected]

Search for more papers by this author
First published: 13 August 2023
Citations: 2
Review Editor: Nir Ben-Tal

Abstract

We have developed an algorithm, ParSe, which accurately identifies from the primary sequence those protein regions likely to exhibit physiological phase separation behavior. Originally, ParSe was designed to test the hypothesis that, for flexible proteins, phase separation potential is correlated to hydrodynamic size. While our results were consistent with that idea, we also found that many different descriptors could successfully differentiate between three classes of protein regions: folded, intrinsically disordered, and phase-separating intrinsically disordered. Consequently, numerous combinations of amino acid property scales can be used to make robust predictions of protein phase separation. Built from that finding, ParSe 2.0 uses an optimal set of property scales to predict domain-level organization and compute a sequence-based prediction of phase separation potential. The algorithm is fast enough to scan the whole of the human proteome in minutes on a single computer and is equally or more accurate than other published predictors in identifying proteins and regions within proteins that drive phase separation. Here, we describe a web application for ParSe 2.0 that may be accessed through a browser by visiting https://stevewhitten.github.io/Parse_v2_FASTA to quickly identify phase-separating proteins within large sequence sets, or by visiting https://stevewhitten.github.io/Parse_v2_web to evaluate individual protein sequences.

1 INTRODUCTION

Protein-mediated macromolecular phase separation, through which membrane-free coacervates form spontaneously from the cellular milieu (Brady et al., 2017; Brangwynne et al., 2009; Lafontaine et al., 2021; Molliex et al., 2015), is increasingly recognized as an important organizing phenomenon in cells (Chong & Forman-Kay, 2016; Gomes & Shorter, 2019; Mitrea & Kriwacki, 2016). By forming specific compartments and micro-environments, protein-mediated macromolecular phase separation, or, more generally, protein phase separation, exerts control over the biochemical reactivity within cells (Jacobs et al., 2021; Li et al., 2018; Zhang et al., 2021). Biological coacervates can form in response to environmental stress (Dao et al., 2018; Rabouille & Alberti, 2017), at specific points in the cell cycle (Gibson et al., 2019; Yamazaki et al., 2022), or exist constitutively (Lafontaine et al., 2021), and have been found to facilitate key cellular processes, including transcription, translation, RNA processing, DNA damage repair, signaling, and metabolism (Chong & Forman-Kay, 2016; Kang et al., 2022; Liu et al., 2021; Lu et al., 2018; Oshidari et al., 2020; Prouteau & Loewith, 2018). Moreover, dysregulation of protein phase separation has been associated with several human diseases (Aguzzi & Altmeyer, 2016; Alberti & Dormann, 2019; Tsang et al., 2020), for example neurodegeneration (Molliex et al., 2015; Prasad et al., 2019) and cancer (Bouchard et al., 2018).

While phase separation can be driven by multivalent interactions between many types of protein domains, including ordered domains (Bouchard et al., 2018; Su et al., 2016), many proteins that drive phase separation have intrinsically disordered regions (IDRs) that are necessary and sufficient for phase separation to occur (Martin et al., 2020; Mitrea & Kriwacki, 2016; Murthy et al., 2019; Uversky et al., 2015; Vernon et al., 2018). Accurate identification of IDRs that drive phase separation is important for testing the underlying mechanisms of phase separation, identifying biological processes that rely on phase separation, and designing sequences that modulate phase separation. To this end, we created the ParSe algorithm (Partition Sequence; voiced as “parse”). ParSe identifies phase-separating (PS) IDRs starting from predictions of hydrodynamic size (Paiz et al., 2021). The correlation between PS IDR potential and hydrodynamic size assumes that the same forces that drive compaction in monomeric proteins also drive protein phase separation (Dignon et al., 2018; Lin et al., 2020; Lin & Chan, 2017; Zeng et al., 2020). Our results were consistent with that idea (Paiz et al., 2021). However, we also found robust property differences between folded, ID, and PS ID protein regions (Ibrahim et al., 2023). In ParSe 2.0, an optimal set of property scales allows facile predictions of domain-level structure and provides a simple, quantitative metric for the sequence-calculated phase separation potential. Notably, the ParSe-computed PS potential can be modified to account for interactions between amino acids and trained to reproduce effects of mutations on phase separation behavior (Ibrahim et al., 2023).

A benefit to using ParSe 2.0, compared to the many other available protein phase separation predictors (Chu et al., 2022; Hardenberg et al., 2020; Klus et al., 2014; Lancaster et al., 2014; Pancsa et al., 2021; Vernon et al., 2018), is that it can be broadly applied for analyses on very large scales, even to entire proteomes. The algorithm is computationally simple and fast enough to scan tens of thousands of sequences in minutes using a single computer. Moreover, its algorithmic simplicity does not diminish its accuracy; we have found ParSe 2.0 to be as, or more, accurate than other published predictors in identifying proteins and regions within proteins that drive phase separation (Ibrahim et al., 2023). We have created a web application that enables researchers to utilize ParSe 2.0 for proteome-scale searches for sequences that drive protein phase separation. Herein, we describe the ParSe 2.0 algorithm and show how this application can quickly search large sets of sequences for proteins and the regions within proteins that are predicted to drive phase separation. Furthermore, we show how the ParSe-computed PS potential can be used to predict mutant phase separation behavior, finding that it reasonably reproduced a newly published dataset (Rekhi et al., 2023) of mutation effects on the saturation concentration (csat) associated with protein phase separation.

2 RESULTS

The ParSe 2.0 algorithm performs three tasks to resolve the regions within a protein that are ID, and which subset of those are likely to drive phase separation in a biological context. The tasks are:
  1. Calculate local properties in the sequence using an optimal set of property scales.
  2. Determine where local properties match the folded, ID, or PS ID classes.
  3. Identify regions of uniform class to predict domain-level organization.

2.1 Calculate local sequence properties

The algorithm predicts the modular organization within a protein from its regional variations in intrinsic sequence properties. ParSe 2.0 continuously determines the average properties within a 25-residue segment, or window, that advances through the whole sequence, as shown in Figure 1a. This approach avoids averaging properties between distant regions that may have different characteristics.

Details are in the caption following the image
The ParSe 2.0 algorithm. (a) A sliding window approach is used to identify regions within a protein that match the folded, ID, and PS ID classes. Hydrophobicity (ϕ), α-helix propensity (α), and vmodel are calculated for each contiguous stretch of 25 residues, or “window,” in the primary sequence. (B) Each window is assigned a label, F, P, or D, depending on the values of ϕ, α, and vmodel. Small circles show the values calculated for the 25-residue windows in the human SSBP4 sequence; ϕ in the top plot, α and vmodel in the bottom plot, and compared to the distributions calculated in the folded (black), ID (red), and PS ID training sets (blue). (C) Window labels are assigned to the central residue of the window. Terminal residues are assigned labels according to the first and last windows. (D) Contiguous regions of at least 20 residues that are 90% of only one label P, D, or F are colored blue, red, or black, respectively, to represent predicted PS, ID, or folded regions. (E) Classifier distance of each window, assigned to the central residue of the window and colored according to its label P (blue), D (red), or F (black).

Several properties are calculated from the sequence in each 25-residue segment: (1) the average hydrophobicity, ϕ, using an amino acid scale from Bastolla et al. (2005), which was derived from contact matrices of globular protein structures, (2) the average intrinsic propensity for α-helix, α, using an amino acid scale from Tanaka and Scheraga (1977) calculated from x-ray data on native proteins, and (3) vmodel, which is a sequence-based model of the polymer scaling exponent (Paiz et al., 2021) that is based on hydrodynamic size and originally was developed from polymer theories to extract information on the balance of self and solvent interactions in long homopolymers (Flory, 1949). Experimental v has been used widely as a measure of chain compaction in biological proteins (Borgia et al., 2016; Hofmann et al., 2012; Kohn et al., 2004; Marsh & Forman-Kay, 2010; Müller-Späth et al., 2010; Riback et al., 2017; Wilkins et al., 1999; Wuttke et al., 2014). The results from this algorithm are relatively independent of the size of the segment or window used (Paiz et al., 2021).

2.2 Match local sequence properties to protein class

Previously, we established that the three classes of proteins regions, for folded, ID, and PS ID, exhibit robust property differences (Ibrahim et al., 2023). This was shown using curated datasets of folded, ID, and PS ID sequences to examine how broadly existing amino acid property scales can be used to distinguish between the three classes of protein regions. We found that ~95% of the 566 scales of amino acid properties in the Amino Acid Index Database (Kawashima & Kanehisa, 2000), a curated set of numerical indices representing various physicochemical and biochemical properties of the amino acids, produced statistically significant differences between the means of the folded and ID sets. The largest statistical separation in the means, determined by t-test (Welch, 1947), was obtained when using the Bastolla scale for hydrophobicity, ϕ. Based on that finding, ParSe 2.0 uses ϕ to identify those 25-residue windows in a sequence that are likely to map to a folded protein region (Figure 1b).

Similarly, the largest statistical separation in the means when comparing the ID and PS ID sequence sets was obtained when using the Tanaka and Scheraga propensity scale, α (Figure 1b); however, there was considerable overlap in the performance of different predictors (Ibrahim et al., 2023). Principal component analysis (PCA) of the variance in the combined sequence sets demonstrated that the variance arising from vmodel was orthogonal to the variance from α, and thus α and vmodel could be combined without significant redundancy when comparing protein sequences (Ibrahim et al., 2023). PCA also was used to reduce the dimensionality in the dataset (Pearson, 1901), finding that most of the variability within the sequence sets measured by high-performing scales can be captured by 2–3 parameters (Ibrahim et al., 2023). Based on that finding, ParSe 2.0 first uses ϕ to identify ID regions (PS ID or ID) as compared to folded regions. Then, it uses α and vmodel to identify the 25-residue windows that are likely to show phase separation behavior.

Window labels are used by ParSe 2.0 to record the results of this decision tree. Windows that match the folded class in ϕ are labeled F; all others are labeled P or D. D is assigned to windows with high α and high vmodel (matching the ID class), while P is assigned to windows with low α and low vmodel (matching the PS ID class). The P/D boundary was determined by the line that bisects the overlapping distributions of α and vmodel in the ID and PS ID training sets. Next, the window label is assigned to the central residue in the window (Figure 1c). N- and C-terminal residues not belonging to a central window position are assigned the label of the central residue in the first and last window, respectively, of the whole sequence.

2.3 Identify regions of uniform protein class

Protein regions predicted by ParSe 2.0 to be folded, ID, or PS ID are determined by finding contiguous residue positions of length ≥ 20 that are ≥90% of only one label F, D, or P, respectively. When an overlap occurs between adjacent predicted regions, owing to the up to 10% label mixing allowed, this overlap is split evenly between the two adjacent regions. Figure 1d shows the application of this ruleset to human single-stranded DNA-binding protein 4 (SSBP4), which is not reported to phase separate in the current literature. For SSBP4, protein regions predicted by ParSe 2.0 to be folded, ID, or PS ID have been colored black, red, or blue, respectively; white corresponds to regions with a mixture of F, D, or P labels.

The classifier distance was developed to assess confidence in the P, D, and F label assignments (Ibrahim et al., 2023). Here, the algorithm calculates the linear distance of a window into its classifier sector, relative to the cutoff boundary and normalized by the distance separating the boundary and the training set mean. For example, the classifier distance for a P-labeled window would be the shortest distance of the window to the P/D boundary divided by the shortest distance of the P/D boundary to the mean of the PS ID training set (see Figure 1b). Thus, for a P-labeled window, values greater than 1 in the classifier distance indicate a window located at a distance further from the P/D boundary than that of the PS ID set mean, whereas values less than 1 indicate a window closer to the cutoff boundary than the PS ID set mean and, as such, possibly with some uncertainty for its classifier label (i.e., a classifier distance that resides within the overlap in the ID and PS ID training set distributions). For D- and F-labeled windows, identically structured calculations are performed using cutoff boundaries and training set means, for D-labeled windows in the D sector of the α versus vmodel plot and for F-labeled windows in the high ϕ region. Position-specific classifier distances calculated from the SSBP4 sequence are shown in Figure 1e.

2.4 Proteome-scale searches using ParSe 2.0

One advantage of ParSe 2.0 is that it is very fast and can be applied to very large datasets. To facilitate searches on a proteomic scale, we have developed the ParSe 2.0 algorithm into a web application that takes a user-supplied input in FASTA format. This application may be accessed through a browser by visiting https://stevewhitten.github.io/Parse_v2_FASTA. The required computation time increases linearly with the number of sequences in the input FASTA file. All calculations are performed on the user's local system through the JavaScript interface. We have found that the application can process, on average, ~14,000 primary sequences per minute using a standard desktop computer. Figure 2 shows the computational expense in minutes for FASTA files containing various-sized sequence sets. The largest proteome used for this figure is the human proteome with splice variants obtained from UniProt (UniProt Consortium, 2021) and representing 75,776 primary sequences. The second largest proteome in the figure, also obtained from UniProt, is the human proteome represented by one sequence per gene and containing 20,594 primary sequences. The computation rate for the ParSe 2.0 web application is compared in the figure to the rates we measured for other available tools used to predict phase separation behavior and that can process multiple input sequences (Chu et al., 2022; Klus et al., 2014; Lancaster et al., 2014; Vernon et al., 2018). ParSe 2.0 is compared in more detail to other predictors below.

Details are in the caption following the image
The ParSe 2.0 web application evaluates sequences at a rate of 14,000 primary sequences per minute. Proteome size is the number of primary sequences. The arrow highlights the linear regression slope of the ParSe trend. PSPredictor, PScore, and catGranule are shown in blue, green, and orange, respectively, and overlay in the plot owing to the scale used. The average computation rates for PSPredictor, PScore, catGranule, and PLAAC were 99, 136, 34, and 31,600 primary sequences per minute, respectively. Computation times were determined by using the appropriate web application or python script (see Section 4) on a 2015 model iMac.

Upon completion, the application outputs datasets that allow the user to quickly identify those proteins within the input file that have regions predicted to drive phase separation behavior. We demonstrate how to use this application and read its output by example below. We also made a second application (accessed at https://stevewhitten.github.io/Parse_v2_web) that can be used to evaluate individual protein sequences, which produces output in the format shown in Figures 1c–e when provided a single, primary sequence as input.

2.5 Characteristics of biological proteins that drive physiological phase separation

ParSe 2.0 was designed to identify PS IDRs from the primary sequence (Ibrahim et al., 2023). First, to demonstrate the output expected from proteins that phase separate, we tested a FASTA file of 43 proteins confirmed to exhibit homotypic phase separation behavior that was curated by Vernon et al. (2018). The UniProt Knowledgebase accession ID (UniProtKB ID), gene name, and primary sequence for the proteins in this set are given in Table S1.

The output includes two sets of plots and four datasets that can be downloaded. The plots are intended to quickly show if the uploaded dataset is enriched or depleted in phase separation potential relative to a reference; here, the reference is the human proteome containing splice variants. Figure 3 reproduces these plots for the Vernon set and shows that this dataset is highly enriched in proteins with long (N ≥ 50) predicted PS IDRs relative to the human proteome (Figure 3a). In addition to length of the predicted PS IDR, the summed classifier distance for every window labeled P has been used as a proxy to estimate the PS potential in a sequence (Ibrahim et al., 2023). We find that the Vernon set is heavily enriched in proteins with computed PS potentials ≥100, relative to the human proteome (Figure 3b). Furthermore, these data are used to create recall plots from which the area under the curve (AUC) is calculated. AUC values >0.5 in either metric indicate a set of sequences enriched in phase separation potential relative to the reference human proteome.

Details are in the caption following the image
Homotypic PS proteins have long predicted PS IDRs and relatively high computed PS potentials. (a) ParSe 2.0 finds regions within proteins that are >90% labeled P, which are predicted to be PS IDRs. Left plot: the y-axis is the percent of proteins in a set with PS IDRs at least as long as the length indicated by the x-axis; the human proteome (black) is compared to the Vernon set (blue). (b) The summed classifier distance of windows labeled P is used to estimate the PS potential in a sequence. Left plot: the y-axis is the percent of proteins in a set with a PS potential equal to or greater than the value indicated by the x-axis. Right plots: in both (a) and (b), the recall plot compares the percent of set for the human proteome against itself (black) and against the Vernon set (blue). AUC is the area under the curve for the Vernon set.

The results can be analyzed using any of four tables, linked after the plots described above. The first is a summary table of sequence-calculated values that can be sorted within the application (or downloaded and sorted separately). Sorting this table by the third column ranks, the input file sequences by their computed PS potential (i.e., the classifier distance sum of windows labeled P), or by the fourth column to rank by length of the longest predicted PS IDR (Figure 4). Thus, proteins within the input file predicted to have PS IDRs can be quickly identified by simply sorting the third or fourth columns of this table. This table for the Vernon set is reproduced in Table S2. If the description line for each sequence is formatted according to UniProt, where the line preceding each primary sequence lists the UniProtKB ID followed by the gene name and protein name, these identifying labels will be listed in the last three columns of the summary table.

Details are in the caption following the image
ParSe 2.0 output Summary Table. A key output of the web application is a summary table that lists the longest PS ID, ID, and folded regions predicted to reside within each primary sequence of the user-supplied input file. Sorting this table by the third or fourth columns can be used to quickly identify proteins predicted by the algorithm to exhibit phase separation behavior.

Located below the summary table in the application are tables containing the data used to make the plots described above, allowing for their reproduction outside of the application. Finally, the application also outputs a FASTA file containing the predicted PS IDRs with length ≥ 50 that were found within the original input file.

2.6 Comparing ParSe 2.0 to other sequence-based predictors

Previously, we found that ParSe 2.0 is at least as accurate in identifying proteins and regions within proteins that drive phase separation compared to other published phase separation predictors and using publicly available datasets (Ibrahim et al., 2023). To demonstrate such predictor comparisons here, we used ParSe 2.0 to generate three separate sequence sets derived from the human proteome. Each set contains protein regions of at least 50 residues identified by our algorithm; the sets differ in the distribution of anticipated protein classes. Predicted PS IDRs comprise the first set, predicted IDRs (non-PS) the second set, and predicted folded regions the third.

Of the protein regions, ParSe 2.0 predicts to be PS IDRs or IDRs; Figure 5a shows that metapredict (Emenecker et al., 2021) and flDPnn (Hu et al., 2021) also predict those regions as ID for >95% of the sequences found in either set. Metapredict was trained on consensus disorder data from eight different disorder predictors, whereas flDPnn was one of the top five predictors of disorder, which were not statistically different from each other in performance, in the recently completed Critical Assessment of protein Intrinsic Disorder (CAID) prediction experiment (Necci et al., 2021). These data suggest that IDR predictions by ParSe 2.0 are likely to be regions predicted as ID by metapredict and flDPnn. For the set not expected to be ID, only a few (<15%) of the ParSe-predicted folded regions were classified as ID by either of these two ID predictors. Thus, ParSe 2.0, despite predicting IDRs from a single metric, ϕ, shows good overall agreement with these two ID predictors when applied to the human proteome.

Details are in the caption following the image
Comparing sequence-based predictors. ParSe 2.0 was used to search the human proteome for protein regions predicted to be PS ID, ID, or folded. Percent of sequences in each set that were (a) predicted to be ID by metapredict (blue) and flDPnn (green), or (b) predicted to potentially phase separate by PSPredictor (blue), FuzDrop (green), PScore (orange), catGranule (red), and PLAAC (black). Percent of set values for PSPredictor, FuzDrop, catGranule, and flDPnn represent the average from randomly sampling in cumulative 500 primary sequences owing to their relatively low input limits (100, 1, 100, and 20 primary sequences, respectively). (c) Sequences in the PS IDR set were separated by protein name, creating subsets based on if the name included “mucin,” “collagen,” “RNA-binding,” “ribonucleo,” “transcription factor” (TF), “homeobox,” “zinc finger” (ZF), or “kinase.” RNA-binding and ribonucleo subsets were combined; the TF, homeobox, zinc finger, and kinase subsets also were combined; “others” refers to PS IDR sequences belonging to all other proteins. The percent of each subset predicted to potentially phase separate is shown using the same coloring scheme and predictors as used in panel B. (d) Overall average percent composition for sequences in the human PS IDR set compared to the subsets for serine and threonine (blue), proline and glycine (green), and arginine and tyrosine (orange).

Figure 5b shows that the predictors PSPredictor (Chu et al., 2022), FuzDrop (Hardenberg et al., 2020), PScore (Vernon et al., 2018), catGranule (Klus et al., 2014), and PLAAC (Lancaster et al., 2014) predict phase separation behavior primarily, though not exclusively, in sequences found in the PS IDR set. PSPredictor was developed from machine learning tools that were trained using sequence data of proteins known to phase separate. FuzDrop was developed using sequence-based estimates of the probability for both disorder and disordered binding to find droplet promoting protein regions. PScore was developed based on a specific molecular mechanism thought to drive phase separation; the propensity of π–π interactions to form cohesive protein interactions. Originally, PLAAC was developed to identify prions, and catGranule targets ID and RNA-binding ability; however, these two algorithms have been widely used as proxies for potential phase separation behavior (Chiu et al., 2022; Pancsa et al., 2021; Shen et al., 2021). Interestingly, PSPredictor, FuzDrop, and catGranule each find substantial proportions (30%–80%) of the predicted non-PS IDR set as possibly showing phase separation behavior.

A key difference among this set of predictors is that ParSe 2.0, PSPredictor, and FuzDrop find PS IDRs within mucins that are mostly missed by PScore, catGranule, and PLAAC (Figure 5C). Mucins are heavily glycosylated proteins that are known to form gel-like assemblies (Demouveaux et al., 2018). On average, a ParSe-predicted mucin PS IDR is enriched in serine and threonine content and depleted in proline, glycine, arginine, and tyrosine, when its composition is compared to the typical PS IDR predicted within a human protein (Figure 5d). Whether or not ParSe-predicted mucin PS IDRs indeed drive physiological phase separation has not been tested. Interestingly, ParSe-predicted PS IDRs within human RNA-binding and ribonucleoproteins are enriched in arginine and tyrosine content, when compared to the human PS IDR composition average, whereas transcription factors, zinc finger proteins, kinases, and proteins with “homeobox” in their name are found to generally match the average predicted PS IDR composition in humans. Others have found that PS, RNA-binding proteins often have tyrosine-rich, low sequence complexity, prion-like domains and arginine-rich RNA-binding domains (Wang et al., 2018). ParSe-predicted PS IDRs within collagen proteins are highly enriched in proline and glycine, which is consistent with the atypical amino acid composition of this protein type.

2.7 Modeling mutation effects on phase separation potential

Extensive mutagenesis studies involving several proteins have been used to understand the sequence features that drive phase separation (Brady et al., 2017; Bremer et al., 2022; Martin et al., 2020; Schuster et al., 2020; Vernon et al., 2018; Wang et al., 2018). The results of those studies implicate specific interactions between amino acids in the formation of phase-separated droplets, for example, cation–anion, cation–π, and π–π. Overall, the main result of many studies is that multiple, redundant molecular mechanisms contribute to the formation of phase-separated droplets from IDRs (Cai et al., 2022; Ibrahim et al., 2023; Rekhi et al., 2023).

Because the PS potential computed by ParSe 2.0 does not include the effects of pairwise interactions involving combinations of amino acid types, the calculation was expanded to contain both the classifier distance sum of P-labeled positions and terms quantifying the effects of interactions between amino acids, termed Uπ for π–π and cation–π interactions and Uq for charge-based effects (Ibrahim et al., 2023). We trained Uπ and Uq against existing data on mutant sequences from Ddx4, LAF-1, and A1 (Brady et al., 2017; Bremer et al., 2022; Martin et al., 2020; Schuster et al., 2020). However, the different studies used different metrics to quantify phase separation potential. We used the saturation concentration (csat) at 4°C and thermodynamic properties associated with phase separation behavior (standard molar ∆h°, ∆s°, and ∆g°) to separately train the calculation. We found that the summed P classifier distance was only moderately able to predict the effects of mutations designed to perturb phase separation behavior. In contrast, the expanded PS potential including Uπ and Uq obtained reasonable predictive power, highlighting the importance of pairwise interactions in modulating phase separation behavior (Ibrahim et al., 2023).

The ParSe 2.0 application targeted at individual protein sequences (accessed at https://stevewhitten.github.io/Parse_v2_web) outputs the computed PS potential both with and without the Uπ and Uq extensions. We used this modified algorithm to assess a newly published mutant dataset that was not included in the training of Uπ and Uq. Mittal and coworkers measured the effects on csat at 37°C from mutation in an artificial IDP consisting of 25-repeats of GRGDSPYS (Rekhi et al., 2023). Figure 6 shows that including Uπ and Uq in the calculation increased the correlation between experimental csat and predicted PS potential (from r = 0.24–0.59). If we used Uπ and Uq trained previously by ∆h°, rather than trained by csat, the predictive power for capturing sequence effects on csat in the new mutant dataset was not as good (r = 0.46, Figure S1). This result is consistent with the observation that mutant rank order in csat (at a given temperature) does not necessarily agree with mutant rank order in the measured thermodynamic properties associated with protein phase separation (Bremer et al., 2022).

Details are in the caption following the image
Mutation effects on experimental csat compared to the sequence-calculated PS potential. (a) PS potential calculated as the summed P classifier distance. (b) Expanded PS potential that includes the summed P classifier distance plus Uπ and Uq trained previously using csat from a mutant dataset. In both plots, experimental csat ratio (x-axis) was digitally extracted from fig. S18A in Rekhi et al. (2023). WT, wild type.

3 DISCUSSION

ParSe 2.0 was developed with a particular focus on predicting which IDRs in a protein sequence can lead to phase separation. Our approach for identifying potential PS IDRs is based primarily on sequence composition and not on sequence patterns or combinations of amino acids. This approach was inspired by the finding that a wide variety of amino acid scales show statistically significant differences between curated ID and PS ID datasets, indicating that PS IDRs are a robustly different class of protein region than conventional, non-PS IDRs (Ibrahim et al., 2023). Similarly, we and others (Dunker et al., 2000; Meng et al., 2017; Romero et al., 2001; Uversky, 2002) have shown that ID (both conventional and PS) and folded protein regions are robustly different in their intrinsic properties, which enables the sequence-based prediction of the modular organization within a protein with respect to ID, PS ID, and folded regions (Figure 1d).

However, to yield reasonable predictive power for mutations that have been designed to test the role of specific amino acid types in driving protein phase separation, the PS potential as computed by ParSe 2.0 has been modified to account for interactions between amino acids. With this modification (i.e., including both Uπ and Uq), we have been able to match existing data on mutant sequences (Ibrahim et al., 2023). The original training of Uπ and Uq used protein constructs that were based on sequences from Ddx4, LAF-1, and A1 (Brady et al., 2017; Bremer et al., 2022; Martin et al., 2020; Schuster et al., 2020). The mutant set in Figure 6 is from an artificial protein construct (Rekhi et al., 2023) and shows surprisingly good agreement between changes in the computed PS potential and changes in the measured csat at 37°C, especially when the PS potential was modified to include Uπ and Uq.

Though, in general, the different phase separation predictors exhibit similar performance when applied to publically available datasets of PS and non-PS protein sequences (Ibrahim et al., 2023), the percent of the human proteome predicted to phase separate can vary substantially by predictor. For example, ParSe 2.0 (Figure 3), PScore (Vernon et al., 2018), and catGranule (Klus et al., 2014) each identify a relatively small subset of the human proteome (~10%–20%) as exhibiting high potential for phase separation when compared to other predictors, including FuzDrop (Hardenberg et al., 2020) that reports ~40%. This could be owing to a narrower focus of some predictors, for example, ParSe for IDR drivers, PScore for a specific mechanism, while other predictors consider both phase separation drivers and clients (Chen et al., 2022; Hardenberg et al., 2020) or a broad set of physical interaction mechanisms (Cai et al., 2022). Indeed, including Uπ and Uq into ParSe 2.0 to account for potential protein–protein interactions increases the number of proteins identified that drive phase separation in the human proteome (Ibrahim et al., 2023). However, whether this is a result of correctly classifying more human proteins as driving phase separation or whether the false negative rate has increased remains to be seen.

We have built web application versions of ParSe 2.0 for the scientific community. Because of its speed, the ParSe 2.0 algorithm can be applied to datasets of large size (Figure 2). The strong performance of ParSe 2.0 on existing datasets, the robust nature of the differences between PS IDRs and conventional IDRs, and the high correlation between ParSe 2.0 and other predictors on databases of PS proteins (Ibrahim et al., 2023) all give confidence that the algorithm can identify PS IDRs with significant accuracy.

4 METHODS

4.1 Window calculation of ϕ, α, and vmodel

ϕ was calculated as the sequence sum divided by the length, N, using the hydrophobicity scale from Bastolla et al. (2005). For a window of 25 residues, N = 25. Similarly, α was calculated as the sequence sum divided by N using the α-helix propensity scale from Tanaka and Scheraga (1977). vmodel, introduced previously (Paiz et al., 2021), was calculated by,
ν model = log R h / R o / log N , $$ {\nu}_{model}=\mathit{\log}\left({R}_h/{R}_o\right)/\mathit{\log}(N), $$ (1)
where Ro was a constant set to 2.16 Å, and the hydrodynamic radius, Rh, was calculated from sequence using an equation found to be accurate for monomeric IDPs (English et al., 2017, 2019; Langridge et al., 2014; Perez et al., 2014; Tomasso et al., 2016). The equation to calculate Rh for a disordered sequence is,
R h = 2.16 Å N 0.503 0.11 ln f PPII + 0.26 Q net 0.29 N 0.5 , $$ {R}_h=2.16\overset{\ocirc }{\mathrm{A}}\bullet {N}^{\left(0.503-0.11\bullet \mathit{\ln}\left({f}_{PPII}\right)\right)}+0.26\bullet \left|{Q}_{net}\right|-0.29\bullet {N}^{0.5}, $$ (2)
where fPPII is the fractional number of residues in the PPII conformation, and Qnet is the net charge. fPPII was estimated from ∑ PPPII,i/N, where PPPII,i is the experimental PPII propensity determined for amino acid type i in unfolded peptides by Elam et al. (2013). Qnet was determined from the number of lysine and arginine residues minus the number of glutamic acid and aspartic acid.

4.2 ParSe 2.0 algorithm

For an arbitrary sequence, whereby the amino acids are restricted to the 20 common types, ParSe 2.0 first reads the sequence to determine its length, N. Next, the algorithm uses a sliding window scheme (Figure 1a) to calculate vmodel, α-helix propensity, and ϕ for every 25-residue segment of the primary sequence. This window scheme can be applied to proteins with N > 25. A window is labeled F if ϕ > 0.08 (Figure 1b). If ϕ < 0.08, a window is labeled P or D depending on the values of vmodel and α-helix propensity. Windows with high α-helix propensity and high vmodel are labeled D, while those with low α-helix propensity and low vmodel are labeled P. The P/D boundary is given by vmodel = −0.244·α-helix propensity +0.789. The window label is assigned to the central residue in that window. N- and C-terminal residues not belonging to a central window position are assigned the label of the central residue in the first and last window, respectively, of the whole sequence. Protein regions predicted to be PS, ID, or folded are determined by finding contiguous residue positions of length ≥20 that are ≥90% of only one label P, D, or F, respectively.

4.3 Classifier distance calculation

The classifier distance is the normalized distance of a ParSe 2.0 generated window into its classifier sector (i.e., F, D, or P sector) and relative to the cutoff boundary (Figure 1b). For F-labeled windows, the classifier distance is ϕ (of the window) minus the cutoff value of 0.08 and then normalized to distance of the folded training set mean ϕ (0.1164) to the cutoff. Specifically, this is (ϕ − 0.08)/(0.1164–0.08). For P or D labeled windows, first we find the point on the P/D boundary (defined above) that makes a perpendicular bisector when paired with the window values of vmodel and α-helix propensity. Then the distance between this point and the point defined by the window values of vmodel and α-helix propensity is determined. Specifically, this distance is sqrt ((α − x)·(α − x) + (vmodel − y)·(vmodel − y)), where α and vmodel are defined above, x is (α/0.244 + 0.789 − vmodel)/(0.244 + 1/0.244), and y is (x − α)/0.244 + vmodel. This distance is normalized by dividing by 0.019, the distance from the boundary to either of the training set means.

4.4 Computed PS potential

The PS potential for a sequence is the summed classifier distance for every window labeled P. This potential can be expanded to include contributions of aromatic and cation-π interactions (Uπ) and charge-based interactions (Uq) as described below.

4.5 Contribution of aromatic and cation–π interactions to the PS potential

The contributions of aromatic and cation–π interactions to protein phase separation follows the observed rank order by Wang et al. (2018): Tyr–Arg > Tyr–Lys ~ Phe–Arg > Phe–Lys. To mimic this ranking, we assumed 3:2:1 weighting and, also, that Phe–Tyr interactions would contribute comparably to Phe–Lys interactions,
U π = a ( 3 # Y × # R / # Y # R # Y # R + 2 # Y × # K / # Y # K # Y # K + 2 # F × # R / # F # R # F # R + 1 # F × # K / # F # K # F # K + 1 # F × # Y / # F # Y # F # Y ) . $$ {U}_{\pi }={\displaystyle \begin{array}{l}a\bullet \Big(3\bullet \left(\#\mathrm{Y}\times \#\mathrm{R}/{\left(\#\mathrm{Y}-\#\mathrm{R}\right)}_{\#\mathrm{Y}\ne \#\mathrm{R}}\right)+2\\ {}\bullet \left(\#\mathrm{Y}\times \#\mathrm{K}/{\left(\#\mathrm{Y}-\#\mathrm{K}\right)}_{\#\mathrm{Y}\ne \#\mathrm{K}}\right)+2\\ {}\bullet \left(\#\mathrm{F}\times \#\mathrm{R}/{\left(\#\mathrm{F}-\#\mathrm{R}\right)}_{\#\mathrm{F}\ne \#\mathrm{R}}\right)+1\\ {}\bullet \left(\#\mathrm{F}\times \#\mathrm{K}/{\left(\#\mathrm{F}-\#\mathrm{K}\right)}_{\#\mathrm{F}\ne \#\mathrm{K}}\right)+1\\ {}\bullet \left(\#\mathrm{F}\times \#\mathrm{Y}/{\left(\#\mathrm{F}-\#\mathrm{Y}\right)}_{\#\mathrm{F}\ne \#\mathrm{Y}}\right)\Big).\end{array}} $$ (3)

In Equation 3, #Y, #R, #F, and #K represent the number of Tyr, Arg, Phe, and Lys residues, respectively, calculated on a per-window basis. Thus, Uπ increases with increasing Tyr, Arg, Phe, and Lys content and more so when interaction partners are present at similar levels. When the divisor is zero (e.g., when #Y = #R), it is changed to 1 to avoid infinite potentials. The fitting parameter a was determined previously (Ibrahim et al., 2023) by finding the optimal correlation of the expanded PS potential to experimental ∆ (finding a = 0.14), ∆ (finding a = 0.08), ∆ (finding a = 0.11), or csat at 4°C (finding a = 0.28). Window-specific Uπ is added to the classifier distance at windows labeled P. Uπ also is calculated at D-labeled windows, allowing for the possibility of labels changing from D to P. This would occur when the value for Uπ was larger than the classifier distance at a D-labeled window. Thus, protein regions that otherwise have characteristics more like the ID set, in vmodel and α-helix propensity, could be labeled P if Uπ was large enough. Here, the given classifier distance was determined by the difference between Uπ and the original classifier distance of the window formerly labeled D.

4.6 Contribution of charge-based interactions to the PS potential

The contributions of charge-based interactions to protein phase separation follows the observations by Schuster et al. (2020) and Bremer et al. (2022) that changes in the sequence charge decoration, SCD, and net charge per residue, NCPR, respectively, can affect phase separation potential. Thus, a simple charge-based potential was defined,
U q = b SCD + c NCPR , $$ {U}_q=b\bullet SCD+c\bullet \mid NCPR\mid, $$ (4)
where b and c are fitting parameters, and Uq is calculated on a per-window basis. Uq is added to the classifier distance at each window labeled P and is applied to windows labeled D, following the scheme described above for Uπ, again allowing for the possibility of labels changing from D to P. The parameters b and c were determined previously (Ibrahim et al., 2023) by finding the optimal correlation of the expanded PS potential and Δ (finding 8.4 and 5.6, respectively), Δ (finding 4.6 and 7.0, respectively), Δ (finding 5.2 and 5.4, respectively), or csat at 4°C (finding −16.0 and 33, respectively). NCPR is the number of Lys and Arg residues minus the number of Glu and Asp residues, divided by N. SCD is calculated by N−1ij,j>i(qiqj)|j-i|1/2, where q is the amino acid-specific charge (Sawle & Ghosh, 2015).

4.7 Metapredict calculation

Metapredict score (Emenecker et al., 2021), which predicts the presence of ID in a sequence, was calculated by computer algorithm using the Python script available at http://metapredict.net. The per-residue average metapredict score, when >0.5, was used to classify a protein region as predicted to be ID.

4.8 flDPnn calculation

flDPnn score (Hu et al., 2021), which predicts the presence of ID in a sequence, was calculated by using the webserver available at http://biomine.cs.vcu.edu/servers/flDPnn. The per-residue average flDPnn binary score, when >0.5, was used to classify a protein region as predicted to be ID.

4.9 PSPredictor calculation

PSPredictor score was calculated by using the PSPredictor (Chu et al., 2022) webtool available at http://www.pkumdl.cn:8000/PSPredictor. PSPredictor score, when >0.5, was used to classify a protein region as predicted to exhibit phase separation behavior.

4.10 FuzDrop calculation

FuzDrop calculations (Hardenberg et al., 2020) used the webtool available at https://fuzdrop.bio.unipd.it/predictor. The residue-based droplet-promoting probability ( p DP $$ {p}_{DP} $$ ), when >90% of residues having p DP $$ {p}_{DP} $$ >0.6, was used to classify a protein region as predicted to exhibit phase separation behavior.

4.11 PSCORE calculation

PSCORE, which is a phase separation propensity predictor (Vernon et al., 2018), was calculated by computer algorithm using the Python script and associated database files available at https://doi.org/10.7554/eLife.31486.022. The overall PScore, when >4, was used to classify a protein region as predicted to exhibit phase separation behavior.

4.12 Granule propensity calculation

Granule propensity was calculated by using the catGranule (Klus et al., 2014) webtool available at http://www.tartaglialab.com. Granule propensity, when >0, was used to classify a protein region as predicted to exhibit phase separation behavior.

4.13 PLAAC LLR calculation

LLR score, which identifies prion-containing sequences (Lancaster et al., 2014), was calculated by using the webtool available at http://plaac.wi.mit.edu. The LLR score, when >0, was used to classify a protein region as predicted to exhibit phase separation behavior.

AUTHOR CONTRIBUTIONS

Loren E. Hough and Steven T. Whitten: conceptualization; Colorado Wilson and Steven T. Whitten: programming; Karen A. Lewis, Nicholas C. Fitzkee, Loren E. Hough, and Steven T. Whitten: methodology; Karen A. Lewis, Nicholas C. Fitzkee, Loren E. Hough, and Steven T. Whitten: formal analysis; Steven T. Whitten: writing—original draft; Karen A. Lewis, Nicholas C. Fitzkee, Loren E. Hough, and Steven T. Whitten: writing—review and editing.

FUNDING INFORMATION

This work was supported by the National Institutes of Health under grants R35GM119755 (Loren E. Hough) and R01AI139479 (Nicholas C. Fitzkee), the National Science Foundation under grants 1818090 (Nicholas C. Fitzkee) and 1943488 (Loren E. Hough), and Texas State University Office of Research and Sponsored Projects through the Research Enhancement Program (Steven T. Whitten and Karen A. Lewis). No nongovernmental sources were used to fund this project. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or NIH.

CONFLICT OF INTEREST STATEMENT

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

A web application of ParSe 2.0 that evaluates individual protein sequences can be accessed at https://stevewhitten.github.io/Parse_v2_web. A web application of ParSe 2.0 that can be used to quickly find phase-separating proteins within large sequence sets can be accessed at https://stevewhitten.github.io/Parse_v2_FASTA. The source code for both applications can be accessed at https://github.com/stevewhitten.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.