Volume 33, Issue 9 e5140

TOOLS FOR PROTEIN SCIENCE

Open Access

Clustering protein functional families at large scale with hierarchical approaches

Nicola Bordin,

Corresponding Author

Nicola Bordin

[email protected]

orcid.org/0000-0002-6568-9035

Institute of Structural and Molecular Biology, University College London, London, UK

Correspondence

Nicola Bordin, Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.

Email: [email protected]

Contribution: Conceptualization, Investigation, Writing - original draft, Methodology, Validation, Visualization, Writing - review & editing, Software

Search for more papers by this author

Harry Scholes,

Harry Scholes

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Methodology, Software, Conceptualization, Investigation, Validation

Search for more papers by this author

Clemens Rauer,

Clemens Rauer

Institute of Structural and Molecular Biology, University College London, London, UK

Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain

Contribution: Conceptualization, Investigation, Methodology, Validation, Software

Search for more papers by this author

Joel Roca-Martínez,

Joel Roca-Martínez

orcid.org/0000-0002-4313-3845

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Software, Validation

Search for more papers by this author

Ian Sillitoe,

Ian Sillitoe

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Methodology, Software

Search for more papers by this author

Christine Orengo,

Christine Orengo

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Conceptualization, Investigation, Funding acquisition, Writing - review & editing, Project administration, Supervision, Resources

Search for more papers by this author

Nicola Bordin,

Corresponding Author

Nicola Bordin

[email protected]

orcid.org/0000-0002-6568-9035

Institute of Structural and Molecular Biology, University College London, London, UK

Correspondence

Nicola Bordin, Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK.

Email: [email protected]

Contribution: Conceptualization, Investigation, Writing - original draft, Methodology, Validation, Visualization, Writing - review & editing, Software

Search for more papers by this author

Harry Scholes,

Harry Scholes

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Methodology, Software, Conceptualization, Investigation, Validation

Search for more papers by this author

Clemens Rauer,

Clemens Rauer

Institute of Structural and Molecular Biology, University College London, London, UK

Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain

Contribution: Conceptualization, Investigation, Methodology, Validation, Software

Search for more papers by this author

Joel Roca-Martínez,

Joel Roca-Martínez

orcid.org/0000-0002-4313-3845

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Software, Validation

Search for more papers by this author

Ian Sillitoe,

Ian Sillitoe

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Methodology, Software

Search for more papers by this author

Christine Orengo,

Christine Orengo

Institute of Structural and Molecular Biology, University College London, London, UK

Contribution: Conceptualization, Investigation, Funding acquisition, Writing - review & editing, Project administration, Supervision, Resources

Search for more papers by this author

First published: 15 August 2024

https://doi.org/10.1002/pro.5140

Citations: 3

Review Editor: Nir Ben-Tal

Share a link

Email
Wechat
Bluesky

Abstract

Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.

1 INTRODUCTION

Evolutionary changes in sequence and structure over millions of years led to the emergence of a variety of functions in proteins. These changes are often conserved if their impact does not affect survivability and can lead to an improvement in cellular fitness by increasing the performance of existing proteins, broadening the specificity, or by creating novel functionalities. Originating from a common evolutionary ancestor, the emergence of different functions is caused by a variety of events, both genetic and environmental. Genetic changes may include whole genome duplications, transposable element insertions, splicing effects, and single nucleotide polymorphisms, among others, while environmental perturbations may effect changes in germinal lines that are transferred to the next generations.

The location of these mutations and their selection is evident from the variability encountered in protein families, where the core of a globular protein domain is mostly conserved across time and the Tree of Life, with further embellishments and variability in noncore regions, active sites, or allosteric sites (Bordin et al., 2021; Chothia, 1992). From a sequence standpoint, these mutations are allowed only in regions that are not fundamentally affecting the folding stability of the protein. Using a combination of sequence and structural data with experimental and in silico approaches, various resources have classified proteins into homologous superfamilies containing members related by evolution, including CATH, SCOP2, SCOPe, and ECOD (Andreeva et al., 2014; Cheng et al., 2014; Fox et al., 2014; Sillitoe et al., 2021). The CATH protein structure classification database classifies in its 4.3 release over half a million protein domain structures in 6631 superfamilies, where each superfamily contains domains that are descendants of a common evolutionary ancestor. The largest 200 superfamilies in CATH, comprising over 62% of the domains in the database, are very diverse in function, having an average of 27 unique EC4 terms, 690 unique GO terms, as well as 11 structural clusters (in which relatives superimpose within 5 Å) per superfamily.

Several approaches have been developed over the years to capture and classify relatives having different functions across superfamilies of proteins. These can be based on multiple sequence alignments (MSA), network analyses, pairwise sequence analyses, coevolving residues in MSAs, and relationships between Hidden Markov Models (HMM). Early strategies inferred a phylogenetic tree from a multiple sequence alignment for each superfamily and subsequently split the tree to generate functional subgroups of proteins (Del Sol et al., 2003; Lichtarge et al., 1996; Sahraeian et al., 2015). Pairwise sequence alignments are used to generate networks of functional relationships, and analyses of signals from coevolutionary residues coupled with statistical analyses are used by numerous approaches (Mihaljević & Urban, 2020; Narayanan et al., 2017; Neuwald et al., 2018; Rivoire et al., 2016; Salinas & Ranganathan, 2018). For a comprehensive review on methods for functional classification of proteins refer, to Rauer et al. (2021).

In order to capture and exploit this wealth of functional diversity, an additional level in CATH was created: Functional Families, a subgrouping of homologous proteins in which relatives are likely to perform the same function (Das, Lee, et al., 2015). Functional Families, or FunFams in short, are generated by the GeMMA algorithm (genome modeling and model annotation). This first clusters sequences assigned to a given CATH superfamily by HMM hits, into clusters of relatives having at least 90% sequence identity and discards clusters that do not have at least one protein with an experimentally-derived GO term in UniProt. These “starting clusters” are aligned and converted to Hidden Markov Models that are compared in an all-versus-all fashion using HH-align. Clusters are iteratively merged, resulting in a tree of relationships between HMMs (Lee et al., 2010). The tree is subsequently repopulated by its original multiple alignments, traversed, and pruned in order to obtain the largest coherent subsets possible in which a putative function is conserved. The algorithm that performs this last step, FunFHMMER, looks for the presence of differentially conserved sets of residues between MSAs, likely to be related to specific functions differing between the alignment subsets (Das, Sillitoe, et al., 2015).

FunFHMMer determines the optimal partitioning of the GeMMA tree into subsets, each of which is a FunFam by operating on the MSAs of leaves and internal nodes (Figure 1).

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

Genome modeling and model annotation (GeMMA) and FunFHMMER algorithms.

This original approach led to the partitioning of 2620 CATH SuperFamilies (42%) in CATH version 4.2 into 67,598 FunFams comprising 8.6 million domain sequences (Sillitoe et al., 2021).

During the process of generating Functional Families for CATH version 4.2, we encountered the upper bound of the original GeMMA/FunFHMMer approach at 5000 starting clusters, and with an ever-growing number of sequences in UniProt being mapped to CATH superfamilies, we explored various avenues to improve the outcome of functional clustering for protein domains. This manuscript explores novel approaches in protein functional clustering, including data prepartitioning from CATH v4.3, algorithm optimization, and a search for a more modern and scalable approach beyond Hidden Markov Models.

2 RESULTS

2.1 CATH-MARC (multi-domain architecture clustering)

The composition and order of domains in a protein is often more conserved than their structures (Apic et al., 2001; Björklund et al., 2005). For example, particular arrangements such as a beta propellor followed by a spaH domain can be found in eukaryotic nuclear pore complexes and proto-membranes in the Planctomycetes-Verrucomicrobiae-Chlamydia, with as little as 4% sequence identity conserved between these SuperKingdoms (Devos et al., 2004; Santarella-Mellwig et al., 2010). These arrangements can be described as the Multi-Domain Architecture (MDA) of a protein (Figure 2a). CATH-MARC–multi domain architecture clustering, an improved version of the GeMMA/FunFHMMEr protocol used to generate the v4.3 release of CATH-FunFams, is based on the principle that proteins having different MDAs are likely to have rather different functions; therefore, data can be prepartitioned more effectively by segregating them ab-initio (Bashton & Chothia, 2007; Dessailly et al., 2009; Lees et al., 2014; Yu et al., 2019). CATH domain assignments on UniProt sequences is available via the ancillary Gene3D resource (Lewis et al., 2018), and their order in the protein sequence are provided as “MDA strings” (i.e., a protein with a P-loop domain at the N-ter, followed by an immunoglobulin, and a “High-signature, UspA, PP-ATPase” (HUP) domain at the C-ter will have the MDA string 3.40.50.300–2.40.60.10–3.40.50.620). The Gene3D MDA assignments are therefore first grouped by MDA, provided there are at least a million sequences, before the initial clustering and GO-based selection. For smaller sets of sequences sharing the same MDA, these are collected in a single set. The algorithm differs from the original implementation of GeMMA/FunFHMMer as the FunFams generated from each MDA are then pooled and used as starting clusters for a final iteration of the algorithm (Figure 2b).

This approach, when applied to the release v4.3 of CATH, led to an overall expansion in the number of FunFams from 68,065 in v4.2 to 212,872, covering an additional 1645 superfamilies (a 61% increase in coverage by superfamily and a 300% increase in total sequences classified). However, this increase in coverage did not hinder EC purity since the overall FunFam purity level and diversity of positions scores (DOPS) increased between the previous release and version 4.3 (Sillitoe et al., 2021; Valdar, 2002). Another key advantage of this approach is its parallelization, as each MDA can be processed separately and reduces the time and memory footprint, reducing the processing time from 6 months (CATH v4.2) to 6 weeks (CATH v4.3) despite a significant increase in sequences classified in CATH FunFams.

2.2 Functional families generation by RANdom splitting (FRAN)

The prepartitioning of domain sequence data introduced with CATH-MARC improves functional coherence and allows each multi-domain architecture to be treated as an individual project for the first iteration of tree building, reducing the computational footprint by a factor of 4 (from 6 months to 6 weeks). Although a significant improvement, a dozen superfamilies, including the P loops and immunoglobulins, have individual MDA sets containing over 2 million sequences and tens of thousands of starting clusters, and these numbers will increase further when metagenomic sequences are classified into superfamilies for functional studies. The Alpha/Beta Hydrolase superfamily, containing various enzymes with environmental and industrial applications, is represented in CATH with 3809 structures and over 800,000 domain sequences. These enzymes present an extreme variety in their structure and substrates, with close to 2000 unique Functional Families in CATH v4.3. By scanning MGnify sequences (Mitchell et al., 2019) against the HMMs built from each S95 representative, the initial number of starting clusters for the Alpha/Beta Hydrolase is close to 1.5 million, three orders of magnitude larger than the current limitation of MARC. A successful strategy for clustering functional groups in these large metagenomic datasets relies on randomly partitioning the initial sequence set, building trees from these randomly assigned sets, and relying on a second iteration of tree building and cutting to merge outliers correctly. Functional Families generation by RANdom splitting (FRAN) generates starting clusters at 90% sequence identity, generates projects, and fills each with a predetermined number of clusters selected randomly from the pool of S90. Each project is then processed using the MARC protocol, with the Functional Families resulting from the first iteration being used as the starting clusters of the second iteration of the tree-building and cutting algorithm (Figure 3).

The randomization involved in the first round of the algorithm affects slightly the ultimate purity of the resulting Functional Families, as the second iteration of the algorithm processes most outliers in each starting cluster introduced by the randomization process.

2.3 CATH-eMMA

Recent advances in protein language models (Elnaggar et al., 2020; Heinzinger et al., 2022; Heinzinger et al., 2023; Rives et al., 2021; Yu et al., 2023) and ultra-fast protein aligners based on the encoding of protein structure conformations (van Kempen et al., 2023) enabled the use of embeddings and structural alphabets as features for determining evolutionary distances as well as sequence and structural similarities. Proximity in hyperdimensional space between the embeddings of two proteins can be used to infer common characteristics, including structure, function, and evolutionary history (Bordin et al., 2023; Durairaj et al., 2023; Gligorijević et al., 2021; Kilinc et al., 2023; Nallapareddy et al., 2023). As GeMMA relies on a very large similarity matrix based on Hidden Markov Model distances, we examined whether it would be effective to replace the core engine of the algorithm with a source-agnostic half-matrix of distances based on embedding (e.g., cosine, Euclidean, and Manhattan) or structural distances (1/bitscore, RMSD), as well as modernizing the pipeline environment from Perl to Python. The resulting algorithm, eMMA (a portmanteau of embeddings and GeMMA), is a Python based CLI pipeline that interfaces with local and distributed platforms (i.e., SGE) (https://github.com/UCLOrengoGroup/eMMA). CATH-eMMA runs modified versions of GeMMA and FunFHMMER, with various advantages over its predecessors (MARC, FRAN, and the original implementation of GeMMA/FunFHMMER). The first step in the eMMA algorithm substitutes CD-HIT (Li & Godzik, 2006) with MMseqs (Steinegger & Söding, 2017), a more modern, faster, and scalable aligner while retaining the same clustering behavior as the former (Figure 4a: Clustering). A significant reduction in memory consumption is based on the second step, which involves generating embeddings or structure-based distance half-matrices only for the cluster representatives, removing the need to store the entire alignment in memory (Figure 4b: Distance matrix generation). A modified version of GeMMA then stores the matrix in memory, removing the need to align, convert, and search HMMs, thus reducing the time needed to create the relationship tree (Figure 4c: Distances relationship tree). Ultimately, each starting cluster is refilled from its representative, and a version of FunFHMMer modified to process distances instead of E-values traverses the GeMMA tree to generate Functional Families (Figure 4d: Tree filling and cutting). The advantages of the eMMA approach are various, as it requires less memory, less computing time, and can be fed diverse sources of data.

2.4 Benchmarks

To benchmark the efficacy and behavior of each approach (GeMMA/FunFHMMER, MARC, FRAN, and eMMA), we clustered the HUPs superfamily in CATH (3.40.50.620). We selected this superfamily as it is very diverse from a structural and functional standpoint, and most of its sequences have extensive functional annotations on UniProt. The HUP domain superfamily has 1685 structures in CATH and over 650 thousand domain relatives in UniProt, as predicted by Gene3D, and exhibits extreme diversity with over 21 structural clusters (relatives superpose within 9 Å), 55 different EC terms, over 600 GO terms, and relatives found in 62 MDA contexts in over 26 thousand organisms. After clustering relatives at 90% sequence identity, we proceeded only with clusters containing experimentally derived EC annotations. Out of 650 thousand HUP sequences, only 40,226 have experimental EC characterization.

MARC and FRAN starting cluster data were prepartitioned according to MDA or randomly split into an equivalent number of partitions, while starting clusters for GeMMA/FunFHMMER and eMMA were processed in bulk and classified into functional families (Table 1). Using the algorithms as described above, we generated FunFams and benchmarked their quality using the EC purity and diversity of positions (DOPs) Score (Valdar, 2002) (Figure 5). The DOPs values provide information on the sequence diversity of the clusters. These are important metrics to consider as FunFam multiple sequence alignments (MSAs) are used to identify conserved residues likely to be linked to functional sites. Clustering too narrowly may increase purity, but at the expense of losing information on conserved sites, as conservation can only be measured in diverse (information rich) MSAs.

TABLE 1. Starting clusters and resulting FunFams for each algorithm applied to the HUPs SuperFamily.

Algorithm	Starting clusters	FunFams 1st iteration	FunFams 2nd iteration
GeMMA/FunFHMMER	3693	1200	1200
MARC	3723	1221	921
FRAN	3693	2789	1099
eMMA	3697	1262	1262

Abbreviations: FRAN, functional families generation by RANdom splitting; FunFams, functional families; GeMMA, Genome modeling and model annotation; MARC, multi-domain architecture clustering.

All methods have comparable performance and purity with the original implementation of the GeMMA/FunFHMMER algorithm while improving in certain areas. By exploiting Multi-Domain Architecture information, MARC improved slightly on the EC purity of GeMMA/FunFHMMEr (Figure 5b,d), while drastically reducing the time and computing requirements of the first iteration of GeMMA (Figure 5a). The first iteration of FunFHMMEr is consistent across these two methods, with a comparable computing time to traverse and cut each tree. The second iteration of GeMMA and FunFHMMEr in MARC takes significantly less time, as Functional Families sequences are already in close vicinity and the pooling process reduces significantly the search space in comparison with the single iteration of GeMMA/FunFHMMER in the original algorithm.

FRAN has a very fast first iteration of GeMMA and FunFHMMER due to the size constraint on their starting cluster numbers, although the resulting FunFams are almost twice the number of the other methods, suggesting that the random splitting is creating a very diverse tree, which is particularly difficult for FunFHMMER to traverse and merge individual functions (Figure 5a). The second iteration of FunFHMMER is the longest across all algorithms, due to the effort involved in comparing and aligning very diverse sets of sequences.

FunFams generated by eMMA have a slight degradation in EC purity due to the algorithm relying on the signal from comparing single sequence representatives for the clusters rather than HMM-HMM comparisons of the multiple sequence alignments from the clusters. Ultimately the tradeoff between memory/speed and purity is favorable, as potentially these EC impurities could be addressed by additional iterations of the algorithm, and future work will investigate additional strategies for comparing the clusters to improve purity.

The diversity of positions score retrieved from scorecons (Valdar, 2002) shows comparable alignment diversity for FunFams generated by GeMMA/FunFHMMER and eMMA, and for FunFams generated by MARC and FRAN (Figure 5c). FunFam eMMA shows slightly lower values than MARC and FRAN, suggesting that more work is needed to optimize merging of the clusters in order to improve diversity and information content whilst ensuring high levels of purity are maintained.

3 DISCUSSION

The classification of protein sequences into sets having coherent functions is valuable for a variety of downstream applications, including the detection of variants, the identification of residues that are differentially conserved across functional families, and the assignment of function to uncharacterized proteins. Functional families have been used successfully in the past to assign putative functions to targets in various rounds of the CAFA functional annotation evaluation (Jiang et al., 2016; Radivojac et al., 2013; Zhou et al., 2019), have identified putative variants in cancer (Sillitoe et al., 2021), and classified Kinases according to their specificity-determining positions (Adeyelu et al., 2023). They are often used as a proxy for assigning functions to whole proteins while being domain-based annotations only. The conserved residues across a FunFam reflect the domain's environment in a multi-domain architecture, so a FunFam assignment can often be used to extrapolate a higher-level function to the whole protein (Das, Lee, et al., 2015; Rentzsch & Orengo, 2013). Functional impurities in FunFams can be ascribed to either the presence of uncertain EC4 annotations (e.g., EC:3.4.12.-) in proteins within each alignment or the incorrect merging of two subsets of sequences with different functions with FunFMMER.

With the emergence of extremely large datasets from metagenomes (Mitchell et al., 2019), and the availability of structures for the majority of proteins in UniProt via the AlphaFold database (Varadi et al., 2022), this treasure trove of evolutionary information is both a boon and a bane for these algorithms, as larger sequence sets contain many unexplored portions of functional space but are increasingly becoming intractable by methods that could cope with hundreds of thousands or a couple of millions of sequences at best. Recent works (Barrio-Hernandez et al., 2023; Durairaj et al., 2023) have exploited novel structural clustering capabilities by Foldseek (van Kempen et al., 2023) to reduce the search space for the entirety of the structure space in UniProt, but the number of cluster representatives is still in tens of millions, and these clusters are based on chain-level groupings, with domain-level annotations still missing from the general picture. Annotations in “dark” regions of the protein structure space are very sparse, while representing large swaths of it, particularly in bacterial genomes (Perdigão et al., 2015). Furthermore, embedding-based annotation tools are rapidly overtaking HMMs as the state of the art for functional inference (Kilinc et al., 2023), and BLAST-like embedding-based tools will enable a rapid and accurate way to search these large regions of the “unknownome.” (Hamamsy et al., 2023) The main pitfall of embedding-based methods is the lack of embedding databases; therefore, these new search tools have a limited search space. Future improvements are likely to involve exploitation of both homology and embedding-based information in a more efficient manner to cluster the protein function space using the structure space as a scaffold to focus our functional annotation efforts.

4 METHODS

4.1 Sequence data

Protein sequences for the HUPs Superfamily in CATH v4.3 (http://www.cathdb.info/version/latest/superfamily/3.40.50.620) were retrieved from the latest release of Gene3D (v21) and their GO and EC annotations downloaded from UniProt using the Retrieve/ID Mapping service (https://www.uniprot.org/id-mapping).

4.2 Clustering

Protein domain sequences were clustered at 90% sequence identity using CD-HIT (Li & Godzik, 2006) for GeMMA/FunFHMMER, MARC, and FRAN. Equivalent behavior was achieved using MMseqs2 (Steinegger & Söding, 2017) by following the documentation to replicate CD-HIT clustering in MMseqs2 at https://github.com/soedinglab/mmseqs2/wiki#how-do-parameters-of-cd-hit-relate-to-mmseqs2.

4.3 Tree generation

Trees for GeMMA/FunFHMMER, MARC, and FRAN were generated using GeMMA (https://github.com/UCL/cath-gemma) (Lee et al., 2010). A modified version of GeMMA used to generate trees from source agnostic matrices is part of the new eMMA Github repository (https://github.com/UCLOrengoGroup/eMMA).

4.4 Tree traversing and cutting

Trees for GeMMA/FunFHMMER, MARC, and FRAN were traversed and cut into individual Functional Families using FunFHMMER version 2.2 (https://github.com/UCL/cath-funfhmmer/tree/funfhmmer_v2.2) (Das, Lee, et al., 2015). A version of FunFHMMER modified to process non-E-value-based distance trees is available at https://github.com/UCLOrengoGroup/eMMA-FunFHMMER.

4.5 Embeddings from protein language models

Embeddings were generated using the dedicated CLI module in eMMA (https://github.com/UCLOrengoGroup/eMMA), which relies on the embedding generation capability of the ESM2 protein language model (esm2_t33_650M_UR50D) developed by MetaAI and available at https://github.com/facebookresearch/esm (Rives et al., 2021).

4.6 Alignment evaluation

Alignment quality metrics such as diversity of positions scores (DOPs) were calculated for each set of FunFams. DOPs rank alignment diversity by measuring the proportion of conserved positions via scorecons (conservation scores) and return a value from 0 (no variability, uninformative alignment) to 100 (highly diverse, informative alignment). DOPs is calculated as part of the implementation of scorecons (Valdar, 2002) included in the cathpy python package available at https://cathpy.readthedocs.io/en/latest/.

AUTHOR CONTRIBUTIONS

Nicola Bordin: Conceptualization; investigation; writing – original draft; methodology; validation; visualization; writing – review and editing; software. Harry Scholes: Methodology; software; conceptualization; investigation; validation. Clemens Rauer: Conceptualization; investigation; methodology; validation; software. Joel Roca-Martinez: software, validation. Ian Sillitoe: Methodology; software. Christine Orengo: Conceptualization; investigation; funding acquisition; writing – review and editing; project administration; supervision; resources.

Open Research

DATA AVAILABILITY STATEMENT

The eMMA Python pipeline is freely available on Github at https://github.com/UCLOrengoGroup/eMMA. The eMMA version of FunFHMMER is available at https://github.com/UCLOrengoGroup/eMMA-FunFHMMER. Benchmark code and data, including the embeddings for the HUPs superfamily is available on Zenodo at https://zenodo.org/doi/10.5281/zenodo.8425747.

REFERENCES

Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, et al. KinFams: de-novo classification of protein kinases using CATH functional units. Biomolecules. 2023; 13: 277.
10.3390/biom13020277
CAS PubMed Google Scholar
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG. SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res. 2014; 42: D310–D314.
10.1093/nar/gkt1242
CAS PubMed Web of Science® Google Scholar
Apic G, Gough J, Teichmann SA. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol. 2001; 310: 311–325.
10.1006/jmbi.2001.4776
CAS PubMed Web of Science® Google Scholar
Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, et al. Clustering predicted structures at the scale of the known protein universe. Nature. 2023; 622(7983): 637–645.
10.1038/s41586-023-06510-w
CAS PubMed Web of Science® Google Scholar
Bashton M, Chothia C. The generation of new protein functions by the combination of domains. Structure. 2007; 15: 85–99.
10.1016/j.str.2006.11.009
CAS PubMed Web of Science® Google Scholar
Björklund AK, Ekman D, Light S, Frey-Skött J, Elofsson A. Domain rearrangements in protein evolution. J Mol Biol. 2005; 353: 911–923.
10.1016/j.jmb.2005.08.067
CAS PubMed Web of Science® Google Scholar
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, et al. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci. 2023; 48: 345–359.
10.1016/j.tibs.2022.11.001
CAS PubMed Web of Science® Google Scholar
Bordin N, Sillitoe I, Lees JG, Orengo C. Tracing evolution through protein structures: nature captured in a few thousand folds. Front Mol Biosci. 2021; 8:668184.
10.3389/fmolb.2021.668184
CAS PubMed Web of Science® Google Scholar
Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, et al. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol. 2014; 10:e1003926.
10.1371/journal.pcbi.1003926
PubMed Web of Science® Google Scholar
Chothia C. One thousand families for the molecular biologist. Nature. 1992; 357: 543–544.
10.1038/357543a0
CAS PubMed Web of Science® Google Scholar
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics. 2015; 31: 3460–3467.
10.1093/bioinformatics/btv398
CAS PubMed Web of Science® Google Scholar
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, et al. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res. 2015; 43: W148–W153.
10.1093/nar/gkv488
CAS PubMed Google Scholar
Del Sol MA, Pazos F, Valencia A. Automatic methods for predicting functionally important residues. J Mol Biol. 2003; 326: 1289–1302.
10.1016/S0022-2836(02)01451-1
CAS PubMed Web of Science® Google Scholar
Dessailly BH, Redfern OC, Cuff A, Orengo CA. Exploiting structural classifications for function prediction: towards a domain grammar for protein function. Curr Opin Struct Biol. 2009; 19: 349–356.
10.1016/j.sbi.2009.03.009
CAS PubMed Web of Science® Google Scholar
Devos D, Dokudovskaya S, Alber F, Williams R, Chait BT, Sali A, et al. Components of coated vesicles and nuclear pore complexes share a common molecular architecture. PLoS Biol. 2004; 2:e380.
10.1371/journal.pbio.0020380
CAS PubMed Web of Science® Google Scholar
Durairaj J, Waterhouse AM, Mets T, Brodiazhenko T, Abdullah M, Studer G, et al. Uncovering new families and folds in the natural protein universe. Nature. 2023; 622: 646–653.
10.1038/s41586-023-06622-3
CAS PubMed Web of Science® Google Scholar
Elnaggar A, Heinzinger M, Dallago C, Rihawi G, Wang Y, Jones L, et al. ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing. 2020 bioRxiv:2020.07.12.199554.
Google Scholar
Fox NK, Brenner SE, Chandonia J-M. SCOPe: structural classification of proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014; 42: D304–D309.
10.1093/nar/gkt1240
CAS PubMed Web of Science® Google Scholar
Gligorijević V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun. 2021; 12: 3168.
10.1038/s41467-021-23303-9
CAS PubMed Web of Science® Google Scholar
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, et al. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol. 2023; 42(6): 975–985.
10.1038/s41587-023-01917-2
PubMed Google Scholar
Heinzinger M, Littmann M, Sillitoe I, Bordin N, Orengo C, Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform. 2022; 4:lqac043.
10.1093/nargab/lqac043
PubMed Google Scholar
Heinzinger M, Weissenow K, Sanchez JG, Henkel A, Steinegger M. Rost B ProstT5: bilingual language model for protein sequence and structure. bioRxiv 2023.07.23.550085. https://doi.org/10.1101/2023.07.23.550085
10.1101/2023.07.23.550085
Google Scholar
Jiang Y, Oron TR, Clark WT, Bankapur AR, D'Andrea D, Lepore R, et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 2016; 17: 184.
10.1186/s13059-016-1037-6
PubMed Web of Science® Google Scholar
Kilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. Proc Natl Acad Sci USA. 2023; 120:e2211823120.
10.1073/pnas.2211823120
CAS PubMed Google Scholar
Lee DA, Rentzsch R, Orengo C. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res. 2010; 38: 720–737.
10.1093/nar/gkp1049
CAS PubMed Web of Science® Google Scholar
Lees JG, Lee D, Studer RA, Dawson NL, Sillitoe I, Das S, et al. Gene3D: multi-domain annotations for protein sequence and comparative genome analysis. Nucleic Acids Res. 2014; 42: D240–D245.
10.1093/nar/gkt1205
CAS PubMed Web of Science® Google Scholar
Lewis TE, Sillitoe I, Dawson N, Lam SD, Clarke T, Lee D, et al. Gene3D: extensive prediction of globular domains in proteins. Nucleic Acids Res. 2018; 46: D435–D439.
10.1093/nar/gkx1069
CAS PubMed Web of Science® Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22: 1658–1659.
10.1093/bioinformatics/btl158
CAS PubMed Web of Science® Google Scholar
Lichtarge O, Bourne HR, Cohen FE. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol. 1996; 257: 342–358.
10.1006/jmbi.1996.0167
CAS PubMed Web of Science® Google Scholar
Mihaljević L, Urban S. Decoding the functional evolution of an intramembrane protease superfamily by statistical coupling analysis. Structure. 2020; 28: 1329–1336.e4.
10.1016/j.str.2020.07.015
CAS PubMed Google Scholar
Mitchell AL, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2019; 48: D570–D578.
Web of Science® Google Scholar
Nallapareddy V, Bordin N, Sillitoe I, Heinzinger M, Littmann M, Waman VP, et al. CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics. 2023; 39:btad029.
10.1093/bioinformatics/btad029
CAS PubMed Google Scholar
Narayanan C, Gagné D, Reynolds KA, Doucet N. Conserved amino acid networks modulate discrete functional properties in an enzyme superfamily. Sci Rep. 2017; 7: 3207.
10.1038/s41598-017-03298-4
PubMed Web of Science® Google Scholar
Neuwald AF, Aravind L, Altschul SF. Inferring joint sequence-structural determinants of protein functional specificity. Elife. 2018; 7:e29880.
10.7554/eLife.29880
PubMed Google Scholar
Perdigão N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, et al. Unexpected features of the dark proteome. Proc Natl Acad Sci U S A. 2015; 112: 15898–15903.
10.1073/pnas.1508380112
CAS PubMed Web of Science® Google Scholar
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013; 10: 221–227.
10.1038/nmeth.2340
CAS PubMed Web of Science® Google Scholar
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol. 2021; 70: 108–122.
10.1016/j.sbi.2021.05.012
CAS PubMed Web of Science® Google Scholar
Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics. 2013; 14: S5.
10.1186/1471-2105-14-S3-S5
CAS PubMed Web of Science® Google Scholar
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021; 118:e2016239118.
10.1073/pnas.2016239118
CAS PubMed Web of Science® Google Scholar
Rivoire O, Reynolds KA, Ranganathan R. Evolution-based functional decomposition of proteins. PLoS Comput Biol. 2016; 12:e1004817.
10.1371/journal.pcbi.1004817
PubMed Web of Science® Google Scholar
Sahraeian SM, Luo KR, Brenner SE. SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res. 2015; 43: W141–W147.
10.1093/nar/gkv461
CAS PubMed Web of Science® Google Scholar
Salinas VH, Ranganathan R. Coevolution-based inference of amino acid interactions underlying protein function. Elife. 2018; 7:e34300.
10.7554/eLife.34300
PubMed Web of Science® Google Scholar
Santarella-Mellwig R, Franke J, Jaedicke A, Gorjanacz M, Bauer U, Budd A, et al. The compartmentalized bacteria of the planctomycetes-verrucomicrobia-chlamydiae superphylum have membrane coat-like proteins. PLoS Biol. 2010; 8:e1000281.
10.1371/journal.pbio.1000281
CAS PubMed Web of Science® Google Scholar
Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021; 49: D266–D273.
10.1093/nar/gkaa1079
CAS PubMed Web of Science® Google Scholar
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35: 1026–1028.
10.1038/nbt.3988
CAS PubMed Web of Science® Google Scholar
Valdar WSJ. Scoring residue conservation. Proteins. 2002; 48: 227–241.
10.1002/prot.10146
CAS PubMed Web of Science® Google Scholar
van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023; 42(2): 243–246.
10.1038/s41587-023-01773-0
PubMed Web of Science® Google Scholar
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022; 50: D439–D444.
10.1093/nar/gkab1061
CAS PubMed Web of Science® Google Scholar
Yu L, Tanwar DK, Penha EDS, Wolf YI, Koonin EV, Basu MK. Grammar of protein domain architectures. Proc Natl Acad Sci U S A. 2019; 116: 3636–3645.
10.1073/pnas.1814684116
CAS PubMed Web of Science® Google Scholar
Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023; 379: 1358–1363.
10.1126/science.adf2465
CAS PubMed Web of Science® Google Scholar
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 2019; 20: 244.
10.1186/s13059-019-1835-8
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume33, Issue9

September 2024

e5140

This article also appears in:

Tools for Protein Science 2024

Clustering protein functional families at large scale with hierarchical approaches

Abstract

1 INTRODUCTION

2 RESULTS

2.1 CATH-MARC (multi-domain architecture clustering)