Volume 32, Issue 12 e4820
TOOLS FOR PROTEIN SCIENCE
Open Access

KEGG tools for classification and analysis of viral proteins

Zhao Jin

Zhao Jin

Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan

Pathway Solutions Inc., Tokyo, Japan

Search for more papers by this author
Yoko Sato

Yoko Sato

Pathway Solutions Inc., Tokyo, Japan

Search for more papers by this author
Masayuki Kawashima

Masayuki Kawashima

Network Support Co. Ltd., Fukuoka, Japan

Search for more papers by this author
Minoru Kanehisa

Corresponding Author

Minoru Kanehisa

Institute for Chemical Research, Kyoto University, Uji, Kyoto, Japan

Correspondence

Minoru Kanehisa, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan.

Email: [email protected]

Search for more papers by this author
First published: 26 October 2023
Citations: 1

Review Editor: Nir Ben-Tal

Abstract

The KEGG database and analysis tools (https://www.kegg.jp) have been developed mostly for understanding genes and genomes of cellular organisms. The KO (KEGG Orthology) dataset, which is a collection of functional orthologs, plays the role of linking genes in the genome to pathways and other molecular networks, enabling KEGG mapping to uncover hidden features in the genome. Although viruses were part of KEGG for some time, they were not fully integrated in the KEGG analysis tools, because the KO assignment rate is very low for virus genes. To supplement KOs a new dataset named virus ortholog clusters (VOCs) is computationally generated, covering 90% of viral proteins in KEGG. VOCs can be used, in place of KOs, for taxonomy mapping to uncover relationships of sequence similarity groups and taxonomic groups and for identifying conserved gene orders in virus genomes. Furthermore, selected VOCs are used to define tentative KOs for characterizing protein functions. Here an overview of KEGG tools is presented focusing on these extensions for viral protein analysis.

1 INTRODUCTION

The KEGG (Kyoto Encyclopedia of Genes and Genomes) database (Kanehisa et al., 2023) has been developed as a computer representation of biological information systems in the cell and the organism, which is manually created from published experimental data and represented in terms of molecular networks and molecular building blocks. The KEGG database has been used with KEGG mapping tools for uncovering hidden features in biological data (Kanehisa et al., 2022; Kanehisa and Sato, 2020), such as for reconstructing metabolic pathways from genome sequence data. The KEGG database is a generic database that can be applied to any cellular organism, for it is based on functional orthologs called KO (KEGG Orthology) groups, each of which represents manually defined gene/protein function by generalizing experimental knowledge in specific organisms to other organisms.

The KEGG database has also been used for understanding human diseases and drugs with the concept of perturbations. For example, diseases are associated with perturbed molecular networks caused by human gene variants, viruses, other pathogens, and environmental factors. Drugs are treated as different types of perturbants affecting perturbed molecular networks. In view of the increasing importance of viruses, efforts have been initiated to make KEGG a more useful resource for exploring viral proteins and viral perturbations. Unfortunately, due to the lack of experimental evidence less than 5% of viral proteins are currently associated with KOs in comparison to over 50% of proteins in cellular organisms. We have thus developed a new dataset named virus ortholog cluster (VOC) for the purpose of supplementing KOs.

Various methods for inferring orthologs have been reported and resulting ortholog databases are made available (Nevers et al., 2022). However, there are only a small number of databases that contain viral proteins, most of which are not frequently updated (Andrade-Martínez et al., 2022). The KEGG GENES database currently contains 50 million genes including 700 thousand viral genes, from which the SSDB database is generated by SSEACH computation (Pearson, 1998) of sequence similarity scores and best hit relations for all genome pairs. KOs are defined manually and expanded computationally using SSDB, which may be viewed as a huge graph of sequence similarity relations, while VOCs are computationally generated from a subgraph consisting of only viral proteins.

We report here extensions of the KEGG mapping tools published in the Tools 2022 Issue of Protein Science (Kanehisa et al., 2022) and the KEGG genome browser (Kanehisa et al., 2023) released in January 2022, focusing on the analysis of viral proteins. VOCs are used in two ways. One is to expand the repertoire of viral KOs with known functions, but the increase of such KOs will be limited. The other is to enable ortholog-based virus comparison and analysis by integrating VOCs in the KEGG tools.

2 OVERVIEW OF KEGG

KEGG (https://www.kegg.jp) is a collection of manually curated databases for various biological objects summarized in Table 1. Each object (database entry) has a unique identifier called KEGG identifier (kid) and can be retrieved by appending /entry/kid in the URL (Kanehisa et al., 2023). KEGG is a computer representation of biological information systems, consisting of molecular networks in the systems information category, molecular building blocks in the genomic and chemical information categories, and perturbed molecular networks in the health information category (Table 1). The basic architecture of the KEGG database is illustrated in Figure 1 together with the concept of KEGG mapping. The KO database plays the central role of linking genes and molecules to molecular networks. The molecular network objects of KEGG pathway maps and BRITE hierarchies, as well as KEGG modules, are created with KO identifiers (K numbers) as network nodes. Once genes in the genome are assigned K numbers by the KEGG annotation procedure, they can be mapped to KEGG molecular networks to generate organism-specific versions, uncovering metabolic and other features hidden in the genome.

TABLE 1. KEGG database contents.
Category Biological objects Database KEGG identifier
Systems information KEGG pathway maps PATHWAY map number
BRITE hierarchies and tables BRITE br/ko number
KEGG modules MODULE M number
Reaction modules RM number
Genomic information KO functional orthologs KO K number
Genes and proteins GENES <org>:<entry>, [ag|vg|vp]:<entry>
Cellular organisms and viruses GENOME T number, gn:<org>, gn:<vtax>
Chemical information Small molecules COMPOUND C number
Glycans GLYCAN G number
Biochemical reactions REACTION R number
Reaction class RC number
Enzyme nomenclature ENZYME ec:<entry>
Health information Network variation maps NETWORK nt number
Network elements N number
Human gene variants VARIANT hsa_var:<entry>
Human diseases DISEASE H number
Drugs DRUG D number
Drug groups DG number
  • Note: <entry>, entry identifier; <org>, KEGG organism code; <vtax>, virus taxonomy ID.
Details are in the caption following the image
In a simplified view, the KEGG database consists of molecular networks (PATHWAY, BRITE), molecular building blocks (GENES, COMPOUND, GLYCAN), and the linkage between them (KO). Cellular and organism-level functions can be revealed from the genome, metagenome, and metabolome through KEGG mapping against the KEGG database.

There are other types of KEGG mapping. For metabolome data, chemical substances identified by mass spectroscopy, for example, may be given C number identifiers of the COMPOUND database to map to KEGG metabolic pathways. For metagenome data, taxonomy mapping may be applied to uncover organism groups and virus groups involved. Table 2 summarizes protein analysis tools in KEGG, including KEGG mapping tools (Kanehisa et al., 2022; Kanehisa and Sato, 2020), automatic KO assignment servers (Kanehisa et al., 2016), and KEGG Genome Browser for synteny analysis (Kanehisa et al., 2023).

TABLE 2. Protein analysis tools in KEGG.
Page URL Tool
KEGG Virus www.kegg.jp/kegg/genome/virus.html Interface to virus data in KEGG including VOC search tool
KEGG Annotation www.kegg.jp/kegg/annotation/ Introduction to KO annotation and ortholog table and module table tools
KEGG Taxonomy www.kegg.jp/kegg/genome/taxonomy.html Taxonomy mapping using Brite hierarchy viewer
KEGG Synteny www.kegg.jp/kegg/genome/synteny.html Conserved gene order (synteny) analysis using KEGG genome browser
KEGG Mapper www.kegg.jp/kegg/mapper/ A suite of KEGG mapping tools
BlastKOALA www.kegg.jp/blastkoala/ Automatic KO assignment by BLAST search
GhostKOALA www.kegg.jp/ghostkoala/ Automatic KO assignment by GHOSTX search

3 VIRUS DATA IN KEGG

Virus data are part of many databases in KEGG. Viral genomes and gene sets are taken from the NCBI RefSeq database (O'Leary et al., 2016) together with virus taxonomy data from the NCBI Taxonomy database (Schoch et al., 2020), which is based on the ICTV (International Committee on Taxonomy of Viruses) classification system (Lefkowitz et al., 2018). All viral genes are stored in the vg category of the GENES database without distinguishing viral genomes and each identified by the NCBI gene ID, such as vg:43740568 for SARS-CoV-2 spike protein. Certain mature peptides processed from gene products are defined for use in KEGG pathway maps and stored in the vp category of the GENES database, such as vp:43740568-1 for SARS-CoV-2 spike protein S1 peptide. In addition, individual viral genomes can now be distinguished by the NCBI taxonomy ID called the vtax identifier of the GENOME database, such as gn:2697049 for SARS-CoV-2. Virus taxonomy in KEGG uses the NCBI (ICTV) taxonomy supplemented with the Baltimore classes (Baltimore, 1971; Koonin et al., 2021) for the distinction of virus genome types and stored as BRITE hierarchy files, such as br08621 for virus taxonomy with fixed levels of taxonomic ranks.

Other virus related data include functional orthologs of viral proteins in the KO database, viral KO classifications in the BRITE database, KEGG pathway maps for virus infections in the PATHWAY database, virus perturbations of human signaling networks in the NETWORK database, viral infectious diseases in the DISEASE database, and antiviral drugs in the DRUG database. The KEGG Virus page (Table 2) is an interface to these different types of virus data in KEGG.

4 VIRUS ORTHOLOG CLUSTER

In order to supplement KOs, a method has been developed to computationally generate virus ortholog clusters using the same resource, namely, the SSDB database, already established for KO annotation. For each sequence in the GENES database, an organism-based list of similarity neighbors is generated from SSDB and displayed in a tabular form, called GFIT table. It shows the best hit sequence in each matching organism, as can be viewed from the GFIT button in the GENES entry page. Using the collection of GFIT tables for viral proteins, VOCs are generated by a simple procedure as follows.
  1. The measure of similarity between two sequences is defined by a modified identity score, min(1, overlap × 2/(aalen1 + aalen2)), where the identity score of the aligned (overlap) region given by SSEARCH is reduced by considering the non-aligned regions.
  2. For a given protein the GFIT table is used to collect similar proteins above a modified identity threshold.
  3. For each of the similar proteins the GFIT table is used to collect additional similar proteins. This process is repeated until no addition is made.

In practice, the GFIT tables are processed in the order of the decreasing table size to merge similar tables, effectively performing single-linkage clustering.

Table 3a shows the result of clustering with the modified identity threshold of 30%, 50% and 70% for the vg dataset generated from RefSeq release 218 of May 2023. Each VOC is given an identifier of six-digit number starting with 3, 5, and 7 for the threshold of 30%, 50%, and 70%, respectively. This number is not a stable identifier and will change every time a new RefSeq release is processed. Table 3a also shows the fraction of viral proteins that share similarity with proteins in cellular organisms. When the similarity threshold is 30%, roughly one third of viral protein clusters are shared by cellular organisms. By splitting into seven Baltimore classes in Table 3b the highest and lowest fractions of organism shared clusters are found, respectively, in ssDNA and dsRNA viruses.

TABLE 3. Statistics of virus ortholog clusters.
(a)
Threshold 30% 50% 70%
Number of clusters 48,127 71,444 81,428
Number of proteins in clusters 581,692 529,394 476,275
Percentage of proteins in clusters 90% 82% 73%
Number of proteins in the largest cluster 40,870 1716 495
Largest number of GFIT table merges 41 17 10
Number of clusters shared with organisms 15,385 7536 4750
Percentage of clusters shared with organisms 32% 11% 6%
(b)
Baltimore class Proteins 30% clusters Shared 30% clusters
I dsDNA 632,739 46,007 15,031 (32%)
II ssDNA 6957 409 187 (45%)
III dsRNA 40,870 294 11 (3%)
IV +ssRNA 8926 554 34 (6%)
V −ssRNA 3990 398 30 (7%)
VI ssRNA-RT 338 41 15 (36%)
VII dsDNA-RT 515 42 11 (26%)
Other 5573 742 216 (29%)
Total 660,913 48,487 15,535 (32%)

The VOC dataset can be searched by keywords including protein name given by RefSeq, virus name, virus family name, Baltimore class, and K number in the KEGG Virus page (Table 2). The result is a list of cluster numbers each associated with the ratio of the number of hits divided by the total number of proteins in the cluster. The list of proteins in each cluster can be viewed by selecting a cluster number. It is still rare that clusters can be associated with KOs, but the need for linking clusters to any functional feature is apparent. Thus, attempts have been initiated to define tentative KOs from selected VOCs using annotations given by RefSeq and original authors who submitted genome sequences. These KOs are not based on published experimental evidence, and placed in the Unclassified category of the KO system.

5 TAXONOMY MAPPING

The KEGG database uses the NCBI taxonomy for classification of cellular organisms and viruses. There are different version of taxonomy files (see KEGG Taxonomy shown in Table 2), which are all represented by BRITE hierarchy files and can be handled with the BRITE hierarchy viewer. Default taxonomy files are br08611 for cellular organisms and br08621 for viruses. Both consist of fixed levels of taxonomic ranks, phylum, class, order, family, genus, and species, as well as realm and kingdom for higher levels in viruses. In addition, the Baltimore classes are associated with the virus taxonomy for viewing genome type differences and corresponding taxonomic branches.

Taxonomy mapping is a procedure to map KEGG organism codes or NCBI taxonomy IDs to a BRITE taxonomy file using the Join capability of the BRITE hierarchy viewer (Kanehisa et al., 2022). In practice, it has been used to examine taxonomic distributions of KOs and modules as can be seen from the Taxonomy button of the KO and module entry pages or from the query interfaces of KEGG Annotation and KEGG Taxonomy (Table 2). Viral proteins are now classified into VOCs, and taxonomy mapping has been extended accordingly. Each virus gene (vg) entry page is linked through the Voc button to a VOC summary page, which displays all members of the three threshold levels of clusters and also allows taxonomy mapping. An example is shown in Figure 2 for vg:1486428, where taxonomic distributions of VOCs and the assigned KO (K23381) may be compared. The same result can be obtained by entering “vg:1486428 K23381” in the query interface of KEGG Taxonomy (Table 2). Note that VOCs can only be specified indirectly by the vg identifier, because the VOC number identifiers are not stable.

Details are in the caption following the image
An example of virus taxonomy mapping, where taxonomic distributions of three types of VOC clusters and the assigned KO are shown for a viral gene, vg:1486428. This can be reproduced as follows: access https://www.kegg.jp/entry/vg:1486428, click on the Voc button and select the Taxonomy link.

6 KEGG GENOME BROWSER

The KEGG genome browser (Kanehisa et al., 2023) is a new tool for viewing and analyzing chromosomal positions of genes in cellular organisms, especially for identifying conserved gene orders, or conserved synteny, among organisms and organism groups under the taxonomic tree (see KEGG Synteny shown in Table 2). Here the genome is viewed as a sequence of genes identified by KOs, or a sequence of K numbers, and conserved gene orders can be found by aligning sequences of matching K numbers. For a given genome and a given location, multiple genomes can thus be aligned according to the conserved gene orders found.

Now the KEGG genome browser allows viral genomes to be treated in the same way, together with the extension to perform the gene order alignment using VOCs in addition to KOs. Figure 3 is an example, where the genome of human cytomegalovirus (human betaherpesvirus 5) is examined around envelope glycoprotein B gene using VOCs at 30% similarity threshold. It is apparent that VOC sequences can capture conserved gene orders in a far more comprehensive way than KO sequences, simply because the KO assignment rate is very low in viruses. Like bacterial genomes viral genomes are known to contain operon structures. Positional links of genes may suggest functional links of proteins.

Details are in the caption following the image
An example of finding conserved gene orders in virus genomes. Human cytomegalovirus (human betaherpesvirus 5) genome is compared with other genomes around envelope glycoprotein B (vg:3077424) using the 30% VOC cluster. This can be reproduced as follows: access https://www.kegg.jp/genome/10359, locate the gene 3077424 using the search box, select a set of genes and perform gene cluster search. When conserved gene orders are found, the taxonomic tree can be used to select which genomes to display.

7 METAGENOME ANNOTATION

There are two servers at the KEGG website available for automatic KO assignment: BlastKOALA suitable for annotation of high-quality genomes and GhostKOALA suitable for analyzing metagenomes (Kanehisa et al., 2016). Since metagenomes of environmental samples usually contain virus data, the GhostKOALA server is briefly described here. GhostKOALA uses the dataset named non-redundant set of pangenome sequences, which is generated from the GENES database by removing redundant (highly similar) sequences for each KO within a family for eukaryotes or within a genus for prokaryotes (Kanehisa et al., 2016). Furthermore, sequences without assigned KOs are selected using CD-hit clusters (Li and Godzik, 2006) and added to the pangenome sequence dataset. For viruses, all RefSeq sequences are used without making a non-redundant set or CD-hit clusters.

Figure 4 shows the result of GhostKOALA annotation for an environmental sample (T30798) from the Tara Oceans project (Sunagawa et al., 2015). The first pie chart (Figure 4a) is a summary of KO assignment shown in KEGG functional categories. The KO assignment data are linked to KEGG Mapper (Kanehisa et al., 2022) for more detailed analysis. The other pie charts show taxonomic distributions of proteins in the sample for cellular organisms (Figure 4b) and viruses (Figure 4c). The virus pie chart is a new addition and is classified according to the Baltimore class and the virus family. Using the BRITE hierarchy viewer for KEGG virus taxonomy, these families can be associated with host data taken from the virus–host database (Mihara et al., 2016). In this case, hosts of two dominant virus families are photosynthetic organisms; Kyanoviridae associated with cyanobacteria and Phycodnaviridae associated with green algae and protist groups of Haptophyta and Stramenopiles.

Details are in the caption following the image
An example of using the automatic KO assignment server GhostKOALA (https://www.kegg.jp/ghostkoala/) for a marine environmental sample (KEGG identifier T30798). The result page contains three pie charts: (a) summary of KO assignments with color coding of KEGG functional categories, (b) taxonomic distributions of proteins in the sample and (c) taxonomic distributions of viral proteins in the sample.

8 CONCLUDING REMARK

The procedure to define KOs generally involves three steps. First, experimentally characterized proteins are identified. Second, they are used as seed sequences to manually define KO groups with appropriate, thus different, threshold levels of sequence similarity. Third, the groups are computationally expanded to cover the entire set of KEGG organisms. In order to link VOCs to functional features, temporary KOs are being defined for viral proteins by relaxing the condition of the first step. Instead of relying on published experimental evidence, annotations given by RefSeq and the original authors are used as functional information. The second and third steps remain the same. Appropriate threshold levels in the second step are usually determined by the specificity of functions depicted in the pathway maps, but for the temporary KOs taxonomic grouping of viruses is used. The daily updated statistics of virus KO assignments are shown in the KEGG Virus page.

AUTHOR CONTRIBUTIONS

Zhao Jin: Data curation (lead); resources (supporting). Yoko Sato: Software (equal); resources (supporting). Masayuki Kawashima: Software (equal); resources (supporting). Minoru Kanehisa: Conceptualization (lead); project administration (lead); resources (lead); writing - original draft preparation (lead).

ACKNOWLEDGMENTS

The KEGG project is partially supported by the NBDC Database Integration Coordination Program JPMJND2203 of the Japan Science and Technology Agency. Computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.

    CONFLICT OF INTEREST STATEMENT

    None.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.