The KEGG database and analysis tools (https://www.kegg.jp) have been developed mostly for understanding genes and genomes of cellular organisms. The KO (KEGG Orthology) dataset, which is a collection of functional orthologs, plays the role of linking genes in the genome to pathways and other molecular networks, enabling KEGG mapping to uncover hidden features in the genome. Although viruses were part of KEGG for some time, they were not fully integrated in the KEGG analysis tools, because the KO assignment rate is very low for virus genes. To supplement KOs a new dataset named virus ortholog clusters (VOCs) is computationally generated, covering 90% of viral proteins in KEGG. VOCs can be used, in place of KOs, for taxonomy mapping to uncover relationships of sequence similarity groups and taxonomic groups and for identifying conserved gene orders in virus genomes. Furthermore, selected VOCs are used to define tentative KOs for characterizing protein functions. Here an overview of KEGG tools is presented focusing on these extensions for viral protein analysis.

1 INTRODUCTION

The KEGG (Kyoto Encyclopedia of Genes and Genomes) database (Kanehisa et al., 2023) has been developed as a computer representation of biological information systems in the cell and the organism, which is manually created from published experimental data and represented in terms of molecular networks and molecular building blocks. The KEGG database has been used with KEGG mapping tools for uncovering hidden features in biological data (Kanehisa et al., 2022; Kanehisa and Sato, 2020), such as for reconstructing metabolic pathways from genome sequence data. The KEGG database is a generic database that can be applied to any cellular organism, for it is based on functional orthologs called KO (KEGG Orthology) groups, each of which represents manually defined gene/protein function by generalizing experimental knowledge in specific organisms to other organisms.

The KEGG database has also been used for understanding human diseases and drugs with the concept of perturbations. For example, diseases are associated with perturbed molecular networks caused by human gene variants, viruses, other pathogens, and environmental factors. Drugs are treated as different types of perturbants affecting perturbed molecular networks. In view of the increasing importance of viruses, efforts have been initiated to make KEGG a more useful resource for exploring viral proteins and viral perturbations. Unfortunately, due to the lack of experimental evidence less than 5% of viral proteins are currently associated with KOs in comparison to over 50% of proteins in cellular organisms. We have thus developed a new dataset named virus ortholog cluster (VOC) for the purpose of supplementing KOs.

Various methods for inferring orthologs have been reported and resulting ortholog databases are made available (Nevers et al., 2022). However, there are only a small number of databases that contain viral proteins, most of which are not frequently updated (Andrade-Martínez et al., 2022). The KEGG GENES database currently contains 50 million genes including 700 thousand viral genes, from which the SSDB database is generated by SSEACH computation (Pearson, 1998) of sequence similarity scores and best hit relations for all genome pairs. KOs are defined manually and expanded computationally using SSDB, which may be viewed as a huge graph of sequence similarity relations, while VOCs are computationally generated from a subgraph consisting of only viral proteins.

We report here extensions of the KEGG mapping tools published in the Tools 2022 Issue of Protein Science (Kanehisa et al., 2022) and the KEGG genome browser (Kanehisa et al., 2023) released in January 2022, focusing on the analysis of viral proteins. VOCs are used in two ways. One is to expand the repertoire of viral KOs with known functions, but the increase of such KOs will be limited. The other is to enable ortholog-based virus comparison and analysis by integrating VOCs in the KEGG tools.

2 OVERVIEW OF KEGG

KEGG (https://www.kegg.jp) is a collection of manually curated databases for various biological objects summarized in Table 1. Each object (database entry) has a unique identifier called KEGG identifier (kid) and can be retrieved by appending /entry/kid in the URL (Kanehisa et al., 2023). KEGG is a computer representation of biological information systems, consisting of molecular networks in the systems information category, molecular building blocks in the genomic and chemical information categories, and perturbed molecular networks in the health information category (Table 1). The basic architecture of the KEGG database is illustrated in Figure 1 together with the concept of KEGG mapping. The KO database plays the central role of linking genes and molecules to molecular networks. The molecular network objects of KEGG pathway maps and BRITE hierarchies, as well as KEGG modules, are created with KO identifiers (K numbers) as network nodes. Once genes in the genome are assigned K numbers by the KEGG annotation procedure, they can be mapped to KEGG molecular networks to generate organism-specific versions, uncovering metabolic and other features hidden in the genome.

TABLE 1. KEGG database contents.

Category	Biological objects	Database	KEGG identifier
Systems information	KEGG pathway maps	PATHWAY	map number
	BRITE hierarchies and tables	BRITE	br/ko number
	KEGG modules	MODULE	M number
	Reaction modules	MODULE	RM number
Genomic information	KO functional orthologs	KO	K number
	Genes and proteins	GENES	<org>:<entry>, [ag\|vg\|vp]:<entry>
	Cellular organisms and viruses	GENOME	T number, gn:<org>, gn:<vtax>
Chemical information	Small molecules	COMPOUND	C number
	Glycans	GLYCAN	G number
	Biochemical reactions	REACTION	R number
	Reaction class	REACTION	RC number
	Enzyme nomenclature	ENZYME	ec:<entry>
Health information	Network variation maps	NETWORK	nt number
	Network elements	NETWORK	N number
	Human gene variants	VARIANT	hsa_var:<entry>
	Human diseases	DISEASE	H number
	Drugs	DRUG	D number
	Drug groups	DRUG	DG number

Note: <entry>, entry identifier; <org>, KEGG organism code; <vtax>, virus taxonomy ID.

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

In a simplified view, the KEGG database consists of molecular networks (PATHWAY, BRITE), molecular building blocks (GENES, COMPOUND, GLYCAN), and the linkage between them (KO). Cellular and organism-level functions can be revealed from the genome, metagenome, and metabolome through KEGG mapping against the KEGG database.

There are other types of KEGG mapping. For metabolome data, chemical substances identified by mass spectroscopy, for example, may be given C number identifiers of the COMPOUND database to map to KEGG metabolic pathways. For metagenome data, taxonomy mapping may be applied to uncover organism groups and virus groups involved. Table 2 summarizes protein analysis tools in KEGG, including KEGG mapping tools (Kanehisa et al., 2022; Kanehisa and Sato, 2020), automatic KO assignment servers (Kanehisa et al., 2016), and KEGG Genome Browser for synteny analysis (Kanehisa et al., 2023).

TABLE 2. Protein analysis tools in KEGG.

Page	URL	Tool
KEGG Virus	www.kegg.jp/kegg/genome/virus.html	Interface to virus data in KEGG including VOC search tool
KEGG Annotation	www.kegg.jp/kegg/annotation/	Introduction to KO annotation and ortholog table and module table tools
KEGG Taxonomy	www.kegg.jp/kegg/genome/taxonomy.html	Taxonomy mapping using Brite hierarchy viewer
KEGG Synteny	www.kegg.jp/kegg/genome/synteny.html	Conserved gene order (synteny) analysis using KEGG genome browser
KEGG Mapper	www.kegg.jp/kegg/mapper/	A suite of KEGG mapping tools
BlastKOALA	www.kegg.jp/blastkoala/	Automatic KO assignment by BLAST search
GhostKOALA	www.kegg.jp/ghostkoala/	Automatic KO assignment by GHOSTX search

3 VIRUS DATA IN KEGG

Virus data are part of many databases in KEGG. Viral genomes and gene sets are taken from the NCBI RefSeq database (O'Leary et al., 2016) together with virus taxonomy data from the NCBI Taxonomy database (Schoch et al., 2020), which is based on the ICTV (International Committee on Taxonomy of Viruses) classification system (Lefkowitz et al., 2018). All viral genes are stored in the vg category of the GENES database without distinguishing viral genomes and each identified by the NCBI gene ID, such as vg:43740568 for SARS-CoV-2 spike protein. Certain mature peptides processed from gene products are defined for use in KEGG pathway maps and stored in the vp category of the GENES database, such as vp:43740568-1 for SARS-CoV-2 spike protein S1 peptide. In addition, individual viral genomes can now be distinguished by the NCBI taxonomy ID called the vtax identifier of the GENOME database, such as gn:2697049 for SARS-CoV-2. Virus taxonomy in KEGG uses the NCBI (ICTV) taxonomy supplemented with the Baltimore classes (Baltimore, 1971; Koonin et al., 2021) for the distinction of virus genome types and stored as BRITE hierarchy files, such as br08621 for virus taxonomy with fixed levels of taxonomic ranks.

Other virus related data include functional orthologs of viral proteins in the KO database, viral KO classifications in the BRITE database, KEGG pathway maps for virus infections in the PATHWAY database, virus perturbations of human signaling networks in the NETWORK database, viral infectious diseases in the DISEASE database, and antiviral drugs in the DRUG database. The KEGG Virus page (Table 2) is an interface to these different types of virus data in KEGG.

4 VIRUS ORTHOLOG CLUSTER

In order to supplement KOs, a method has been developed to computationally generate virus ortholog clusters using the same resource, namely, the SSDB database, already established for KO annotation. For each sequence in the GENES database, an organism-based list of similarity neighbors is generated from SSDB and displayed in a tabular form, called GFIT table. It shows the best hit sequence in each matching organism, as can be viewed from the GFIT button in the GENES entry page. Using the collection of GFIT tables for viral proteins, VOCs are generated by a simple procedure as follows.

The measure of similarity between two sequences is defined by a modified identity score, min(1, overlap × 2/(aalen1 + aalen2)), where the identity score of the aligned (overlap) region given by SSEARCH is reduced by considering the non-aligned regions.
For a given protein the GFIT table is used to collect similar proteins above a modified identity threshold.
For each of the similar proteins the GFIT table is used to collect additional similar proteins. This process is repeated until no addition is made.

In practice, the GFIT tables are processed in the order of the decreasing table size to merge similar tables, effectively performing single-linkage clustering.

Table 3a shows the result of clustering with the modified identity threshold of 30%, 50% and 70% for the vg dataset generated from RefSeq release 218 of May 2023. Each VOC is given an identifier of six-digit number starting with 3, 5, and 7 for the threshold of 30%, 50%, and 70%, respectively. This number is not a stable identifier and will change every time a new RefSeq release is processed. Table 3a also shows the fraction of viral proteins that share similarity with proteins in cellular organisms. When the similarity threshold is 30%, roughly one third of viral protein clusters are shared by cellular organisms. By splitting into seven Baltimore classes in Table 3b the highest and lowest fractions of organism shared clusters are found, respectively, in ssDNA and dsRNA viruses.

TABLE 3. Statistics of virus ortholog clusters.

(a)
Threshold	30%	50%	70%
Number of clusters	48,127	71,444	81,428
Number of proteins in clusters	581,692	529,394	476,275
Percentage of proteins in clusters	90%	82%	73%
Number of proteins in the largest cluster	40,870	1716	495
Largest number of GFIT table merges	41	17	10
Number of clusters shared with organisms	15,385	7536	4750
Percentage of clusters shared with organisms	32%	11%	6%

(b)
Baltimore class		Proteins	30% clusters	Shared 30% clusters
I	dsDNA	632,739	46,007	15,031	(32%)
II	ssDNA	6957	409	187	(45%)
III	dsRNA	40,870	294	11	(3%)
IV	+ssRNA	8926	554	34	(6%)
V	−ssRNA	3990	398	30	(7%)
VI	ssRNA-RT	338	41	15	(36%)
VII	dsDNA-RT	515	42	11	(26%)
–	Other	5573	742	216	(29%)
	Total	660,913	48,487	15,535	(32%)

The VOC dataset can be searched by keywords including protein name given by RefSeq, virus name, virus family name, Baltimore class, and K number in the KEGG Virus page (Table 2). The result is a list of cluster numbers each associated with the ratio of the number of hits divided by the total number of proteins in the cluster. The list of proteins in each cluster can be viewed by selecting a cluster number. It is still rare that clusters can be associated with KOs, but the need for linking clusters to any functional feature is apparent. Thus, attempts have been initiated to define tentative KOs from selected VOCs using annotations given by RefSeq and original authors who submitted genome sequences. These KOs are not based on published experimental evidence, and placed in the Unclassified category of the KO system.

5 TAXONOMY MAPPING

The KEGG database uses the NCBI taxonomy for classification of cellular organisms and viruses. There are different version of taxonomy files (see KEGG Taxonomy shown in Table 2), which are all represented by BRITE hierarchy files and can be handled with the BRITE hierarchy viewer. Default taxonomy files are br08611 for cellular organisms and br08621 for viruses. Both consist of fixed levels of taxonomic ranks, phylum, class, order, family, genus, and species, as well as realm and kingdom for higher levels in viruses. In addition, the Baltimore classes are associated with the virus taxonomy for viewing genome type differences and corresponding taxonomic branches.

Taxonomy mapping is a procedure to map KEGG organism codes or NCBI taxonomy IDs to a BRITE taxonomy file using the Join capability of the BRITE hierarchy viewer (Kanehisa et al., 2022). In practice, it has been used to examine taxonomic distributions of KOs and modules as can be seen from the Taxonomy button of the KO and module entry pages or from the query interfaces of KEGG Annotation and KEGG Taxonomy (Table 2). Viral proteins are now classified into VOCs, and taxonomy mapping has been extended accordingly. Each virus gene (vg) entry page is linked through the Voc button to a VOC summary page, which displays all members of the three threshold levels of clusters and also allows taxonomy mapping. An example is shown in Figure 2 for vg:1486428, where taxonomic distributions of VOCs and the assigned KO (K23381) may be compared. The same result can be obtained by entering “vg:1486428 K23381” in the query interface of KEGG Taxonomy (Table 2). Note that VOCs can only be specified indirectly by the vg identifier, because the VOC number identifiers are not stable.

6 KEGG GENOME BROWSER

The KEGG genome browser (Kanehisa et al., 2023) is a new tool for viewing and analyzing chromosomal positions of genes in cellular organisms, especially for identifying conserved gene orders, or conserved synteny, among organisms and organism groups under the taxonomic tree (see KEGG Synteny shown in Table 2). Here the genome is viewed as a sequence of genes identified by KOs, or a sequence of K numbers, and conserved gene orders can be found by aligning sequences of matching K numbers. For a given genome and a given location, multiple genomes can thus be aligned according to the conserved gene orders found.

Now the KEGG genome browser allows viral genomes to be treated in the same way, together with the extension to perform the gene order alignment using VOCs in addition to KOs. Figure 3 is an example, where the genome of human cytomegalovirus (human betaherpesvirus 5) is examined around envelope glycoprotein B gene using VOCs at 30% similarity threshold. It is apparent that VOC sequences can capture conserved gene orders in a far more comprehensive way than KO sequences, simply because the KO assignment rate is very low in viruses. Like bacterial genomes viral genomes are known to contain operon structures. Positional links of genes may suggest functional links of proteins.

7 METAGENOME ANNOTATION

There are two servers at the KEGG website available for automatic KO assignment: BlastKOALA suitable for annotation of high-quality genomes and GhostKOALA suitable for analyzing metagenomes (Kanehisa et al., 2016). Since metagenomes of environmental samples usually contain virus data, the GhostKOALA server is briefly described here. GhostKOALA uses the dataset named non-redundant set of pangenome sequences, which is generated from the GENES database by removing redundant (highly similar) sequences for each KO within a family for eukaryotes or within a genus for prokaryotes (Kanehisa et al., 2016). Furthermore, sequences without assigned KOs are selected using CD-hit clusters (Li and Godzik, 2006) and added to the pangenome sequence dataset. For viruses, all RefSeq sequences are used without making a non-redundant set or CD-hit clusters.

Figure 4 shows the result of GhostKOALA annotation for an environmental sample (T30798) from the Tara Oceans project (Sunagawa et al., 2015). The first pie chart (Figure 4a) is a summary of KO assignment shown in KEGG functional categories. The KO assignment data are linked to KEGG Mapper (Kanehisa et al., 2022) for more detailed analysis. The other pie charts show taxonomic distributions of proteins in the sample for cellular organisms (Figure 4b) and viruses (Figure 4c). The virus pie chart is a new addition and is classified according to the Baltimore class and the virus family. Using the BRITE hierarchy viewer for KEGG virus taxonomy, these families can be associated with host data taken from the virus–host database (Mihara et al., 2016). In this case, hosts of two dominant virus families are photosynthetic organisms; Kyanoviridae associated with cyanobacteria and Phycodnaviridae associated with green algae and protist groups of Haptophyta and Stramenopiles.

8 CONCLUDING REMARK

The procedure to define KOs generally involves three steps. First, experimentally characterized proteins are identified. Second, they are used as seed sequences to manually define KO groups with appropriate, thus different, threshold levels of sequence similarity. Third, the groups are computationally expanded to cover the entire set of KEGG organisms. In order to link VOCs to functional features, temporary KOs are being defined for viral proteins by relaxing the condition of the first step. Instead of relying on published experimental evidence, annotations given by RefSeq and the original authors are used as functional information. The second and third steps remain the same. Appropriate threshold levels in the second step are usually determined by the specificity of functions depicted in the pathway maps, but for the temporary KOs taxonomic grouping of viruses is used. The daily updated statistics of virus KO assignments are shown in the KEGG Virus page.

AUTHOR CONTRIBUTIONS

Zhao Jin: Data curation (lead); resources (supporting). Yoko Sato: Software (equal); resources (supporting). Masayuki Kawashima: Software (equal); resources (supporting). Minoru Kanehisa: Conceptualization (lead); project administration (lead); resources (lead); writing - original draft preparation (lead).

ACKNOWLEDGMENTS

The KEGG project is partially supported by the NBDC Database Integration Coordination Program JPMJND2203 of the Japan Science and Technology Agency. Computational resources were provided by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.

CONFLICT OF INTEREST STATEMENT

None.

REFERENCES

Andrade-Martínez JS, Camelo Valera LC, Chica Cárdenas LA, Forero-Junco L, López-Leal G, Moreno-Gallego JL, et al. Computational tools for the analysis of uncultivated phage genomes. Microbiol Mol Biol Rev. 2022; 86:e0000421.
10.1128/mmbr.00004-21
PubMed Web of Science® Google Scholar
Baltimore D. Expression of animal virus genomes. Bacteriol Rev. 1971; 35: 235–241.
10.1128/br.35.3.235-241.1971
CAS PubMed Web of Science® Google Scholar
Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023; 49: D545–D551.
10.1093/nar/gkaa970
Google Scholar
Kanehisa M, Sato Y. KEGG mapper for inferring cellular functions from protein sequences. Protein Sci. 2020; 29: 28–35.
10.1002/pro.3711
CAS PubMed Web of Science® Google Scholar
Kanehisa M, Sato Y, Kawashima M. KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 2022; 31: 47–53.
10.1002/pro.4172
CAS PubMed Web of Science® Google Scholar
Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J Mol Biol. 2016; 428: 726–731.
10.1016/j.jmb.2015.11.006
CAS PubMed Web of Science® Google Scholar
Koonin EV, Krupovic M, Agol VI. The Baltimore classification of viruses 50 years later: how does it stand in the light of virus evolution? Microbiol Mol Biol Rev. 2021; 85:e0005321.
10.1128/MMBR.00053-21
PubMed Web of Science® Google Scholar
Lefkowitz EJ, Dempsey DM, Hendrickson RC, Orton RJ, Siddell SG, Smith DB. Virus taxonomy: the database of the international committee on taxonomy of viruses (ICTV). Nucleic Acids Res. 2018; 46: D708–D717.
10.1093/nar/gkx932
CAS PubMed Web of Science® Google Scholar
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006; 22: 1658–1659.
10.1093/bioinformatics/btl158
CAS PubMed Web of Science® Google Scholar
Mihara T, Nishimura Y, Shimizu Y, Nishiyama H, Yoshikawa G, Uehara H, et al. Linking virus genomes with host taxonomy. Viruses. 2016; 8: 66.
10.3390/v8030066
PubMed Web of Science® Google Scholar
Nevers Y, Jones TEM, Jyothi D, Yates B, Ferret M, Portell-Silva L, et al. The quest for Orthologs orthology benchmark service in 2022. Nucleic Acids Res. 2022; 50: W623–W632.
10.1093/nar/gkac330
CAS PubMed Web of Science® Google Scholar
O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44: D733–D745.
10.1093/nar/gkv1189
CAS PubMed Web of Science® Google Scholar
Pearson WR. Empirical statistical estimates for sequence similarity searches. J Mol Biol. 1998; 276: 71–84.
10.1006/jmbi.1997.1525
CAS PubMed Web of Science® Google Scholar
Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database. 2020; 2020: baaa062.
10.1093/database/baaa062
CAS PubMed Google Scholar
Sunagawa S, Coelho LP, Chaffron S, Kultima JR, Labadie K, Salazar G, et al. Ocean plankton. Structure and function of the global ocean microbiome. Science. 2015; 348:1261359.
10.1126/science.1261359
CAS PubMed Web of Science® Google Scholar

Citing Literature

Volume32, Issue12

December 2023

e4820

This article also appears in:

Tools for Protein Science 2024

KEGG tools for classification and analysis of viral proteins

Abstract

1 INTRODUCTION

2 OVERVIEW OF KEGG

3 VIRUS DATA IN KEGG

4 VIRUS ORTHOLOG CLUSTER

5 TAXONOMY MAPPING

6 KEGG GENOME BROWSER

7 METAGENOME ANNOTATION

8 CONCLUDING REMARK

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST STATEMENT

REFERENCES

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

KEGG tools for classification and analysis of viral proteins

Abstract

1 INTRODUCTION

2 OVERVIEW OF KEGG

3 VIRUS DATA IN KEGG

4 VIRUS ORTHOLOG CLUSTER

5 TAXONOMY MAPPING

6 KEGG GENOME BROWSER

7 METAGENOME ANNOTATION

8 CONCLUDING REMARK

AUTHOR CONTRIBUTIONS

ACKNOWLEDGMENTS

CONFLICT OF INTEREST STATEMENT

REFERENCES

Citing Literature

Figures

References

Related

Information