The mutation features and geographical distributions of the surface glycoprotein (S gene) in SARS-CoV-2 strains: A comparative analysis of the early and current strains
Abstract
The surface glycoprotein (S protein) of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was used to develop coronavirus disease 2019 (COVID-19) vaccines. However, SARS-CoV-2, especially the S protein, has undergone rapid evolution and mutation, which has remained to be determined. Here, we analyzed and compared the early (12 237) and the current (more than 10 million) SARS-CoV-2 strains to identify the mutation features and geographical distribution of the S gene and S protein. Results showed that in the early strains, most of the loci were with relative low mutation frequency except S: 23403 (4486 strains), while in the current strains, there was a surge in the mutation strains and frequency, with S: 23403 constantly being the highest one, but tremendously increased to approximately 1050 times. Furthermore, D614 (S: 23403) was one of the most highly frequent mutations in the S protein of Omicron as of March 2022, and most of the mutant strains were still from the United States, and the United Kingdom. Further analysis demonstrated that in the receptor-binding domain, most of the loci with low mutation frequency in the early strains, while S: 22995 was nowadays the most prevalent loci with 3 122 491 strains in the current strains. Overall, we compare the mutation features of the S region in SARS-CoV-2 strains between the early and the current stains, providing insight into further studies in concert with emerging SARS-CoV-2 variants for COVID-19 vaccines.
1 INTRODUCTION
The outbreak of coronavirus disease 2019 (COVID-19), caused by a new coronavirus named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2),1 a global pandemic, has severely impacted public health systems around the world. Until April 8, 2022, the total confirmed cases have reached more than 494 million, including 6 170 283 deaths globally.2 The scale of the humanitarian and the economic impact of the pandemic was driving the evaluation of the COVID-19 vaccine through novel platforms to accelerate the process of development. Currently, 24 COVID-19 vaccines are authorized and used.3 As of April 4, 2022, more than 11 billion doses have been administered.2
Evidence has shown that SARS-CoV-2 shared a similar sequence and used the same cell entry receptor as SARS-CoV.1, 4 In coronavirus infection, the surface spike glycoprotein (S protein) on the virion surface mediated receptor recognition with angiotensin-converting enzyme II (ACE2) and membrane fusion conformation with host cells.5-7 Notably, both S and N proteins were proposed to be a potential vaccine candidate for immunogenicity and T-cell immune response in Middle East respiratory syndrome coronavirus (MERS-CoV).8, 9 However, only S protein was demonstrated to induce neutralizing antibodies and T-cell immune responses.9
Since the first COVID-19 vaccine (messenger RNA [mRNA]-1273) candidate, based on a novel lipid nanoparticle-encapsulated mRNA vaccine that encodes S protein, entered Phase I clinical testing (NCT04283461) on March 16, 2020, there are 114 COVID-19 vaccine candidates reaching clinical trials, with 75 candidates evaluated in the preclinical stage and 48 in the final phase of testing.10 Most vaccine candidates aimed to induce neutralizing antibodies against the viral S protein,11 which prevented the uptake by the ACE2 receptor. Compared to the whole virus vaccines, S protein-based vaccines are more well-tolerated and relatively safer.10 In the very early stage of vaccine development, the adenovirus type-5 (Ad5)-vectored COVID-19 vaccine (NCT04313127) expressing full-length spike protein (S protein) showed tolerance and immunogenic T-cell responses in the Phase I clinical testing.12 However, given that SARS-CoV-2, especially the S protein, has undergone rapid evolution and mutation, there are several dominant variants worldwide in less than 2.5 years.13 The future COVID-19 might possibly be treated as “flu” and the efficient vaccine for the whole population is considered one of the most crucial countermeasures, which is still a great concern of our researchers, however.3
A spike mutation D614G was revealed in May 2020 and increased in frequency globally, and this mutation was correlated with the increased viral infection.14 Here we define strains as of April 2020 as the early strains. In this study, we comprehensively compare the early and the current prevalent SARS-CoV-2 strains in the 2019 Novel Coronavirus Resource (2019nCoVR) database to identify the spatiotemporal features of the genome mutations of the S gene in these strains over time and over countries. Moreover, we integrated the ACE2-binding region of S protein to characterize the mutations of the functional region in SARS-CoV-2 strains.
2 MATERIALS AND METHODS
2.1 Genome sequences of early SARS-CoV-2 strains in the database
The results of the genome mutations of 12 237 early SARS-CoV-2 strains with complete whole-genome were from the 2019 Novel Coronavirus Resource (2019nCoVR) database (https://bigd.big.ac.cn/ncov)15 between December 2019 and April 2020. The resources of SARS-CoV-2 genome sequences were integrated from the database, which was supported by the National Genomics Data Center of China National Center for Bioinformation/Beijing Institute of Genomics. The genome sequences were from China National GeneBank DataBase (CNGBdb) (https://db.cngb.org/), GenBank, Genome Warehouse, GISAID (https://www.gisaid.org/), National Microbiology Data Center (NMDC) (http://www.nmdc.cn/) (Supporting Information: Table 1).
2.2 Genetic mutations of S proteins in SARS-CoV-2 strains in the database
Genetic mutations of S protein in the early and the current SARS-CoV-2 strains were integrated from the 2019nCoVR data set15 (Supporting Information: Tables 2 and 3). The reference of the SARS-CoV-2 genome was NC_045512 (NCBI: txid2697049). The mutation types were single-nucleotide polymorphism (SNP), deletion, insertion, and indel. One site could contain more than one type of mutation.
2.3 Heatmaps of S gene mutation in SARS-CoV-2 strains
Heatmaps of S gene mutations in SARS-CoV-2 strains were performed by the 2019nCoVR data set with default settings (https://bigd.big.ac.cn/ncov).15 The heatmaps of S gene mutations in SARS-CoV-2 strains showed the sites with a mutation frequency of 0.0001, 0.1, and 0.5. The bar color was from 0 to 1. S gene mutations in SARS-CoV-2 strains from Cambodia, Iran, and Poland, were not shown in the heatmaps but were demonstrated in the histogram and line chart (Supporting Information: Figure 1).
2.4 Amino acid mutations of SARS-CoV-2 strains
Amino acid mutation annotation of S protein of SARS-CoV-2 strains was based on the NCBI reference sequence NC_045512 (GeneID:43740568, QHD43416.1), including coding sequence variation, frameshift variation, inframe variation; missense variation, stop gained variation, and synonymous. All results were performed with default settings.
2.5 Mutant positions of S protein in the prevalent variants
The composition of the prevalent lineage in every timeframe was integrated from the 2019nCoVR data set with default settings (https://bigd.big.ac.cn/ncov).15 The heatmaps of different mutant positions of S protein in every prevalent variant were also integrated. Mutations with frequency >0.5 were shown.
2.6 Variation dynamic curve
The variation dynamic curve was based on the genetic mutation of SARS-CoV-2 strains over time and countries. The information regarding SARS-CoV-2 strains was integrated with the 2019nCoVR database,15 including the sample collection date, submitting lab, sample host, and location of the strains. Each curve indicated the genome variation of the mutation site of the S region over time and countries.
2.7 Three-dimensional (3D) structure of the ACE2 binding region in the S protein
A 3D structure of the receptor-binding domain (RBD), also the ACE2-binding region, was performed in the SWISS-MODEL server,16-18 which was demonstrated by the rope model in the NGL viewer.19, 20 S protein is composed of three chains (A, B, and C), using PDB_ID: 6VSB structure.21 The residues from 336 to 516 formed the binding region of the S protein with its human receptor ACE2. Some of the sites with mutant strains were shown in the binding region of S protein.
2.8 Time and area frequencies
The time frequency of each mutation site of the S gene in the SARS-CoV-2 strains was calculated by the frequency of strains containing the mutation over each time point. The area frequency of each mutation site of the S gene in the SARS-CoV-2 strains was calculated by the frequency of strains containing the mutation over each country. The frequency and the number of strains were indicated in the graph.
2.9 Statistics of time variance and area variance
Time variance was calculated by the population frequency of each mutant site over time and was evaluated by the dispersion of the site via calculating the variance of population frequency at each time. Area variance was calculated by the population frequency of each mutation site, with country, province, and city as region units, and was evaluated by the variance dispersion of the site via calculating the variance of population frequency in each region (Figure 4A and Supporting Information: Figure 2).
3 RESULTS
3.1 Genome mutation heatmaps of S gene in early SARS-CoV-2 strains by countries
We firstly analyzed the genome mutations of the S gene from 12 237 early SARS-CoV-2 strains across 60 countries and we identified 499 loci (frequency > 0.0001, Figure 1A), 32 loci (frequency > 0.1, Figure 1B), and 6 loci (frequency > 0.5, Supporting Information: Figure 1). Notably, most of the mutation loci showed low mutant frequency by country, and the mutation loci were different among countries except the locus of 23 403. (Figure 1A). Among the strains from different countries, 60% of countries contained only one mutant hot locus (frequency > 0.1) and 18% contained more than three mutant hot loci (frequency > 0.1; Figure 1B).

Although countries like Australia, China, the United Kingdom, and the United States with more than 1000 strains showed more mutant loci, their mutation frequencies were low (Figure 1A). In addition, mutation frequencies in Slovenia and South Africa were high, with two loci (frequency > 0.5), but strains in the two countries were 4 and 7, respectively (Supporting Information: Figure 1).
3.2 Mutation characteristics of S gene in early SARS-CoV-2 strains
Next, we integrated all the mutant strains in the S gene (NC_ 045512.2) to find out the mutation characteristics of the early SARS-CoV-2 strains. Among the 579 mutant loci, most of them (90%) were with less than five strains and only 2.94% were with more than 20 strains, especially in S: 23403 (4486 strains) and S: 24034 (108 strains) (Figure 2A,B). In addition, 98% of the mutant loci were SNP and 2% were deletion (Figure 2C).

Also, the prevalent type of amino acid mutation was missense (53%) and coding sequence variant (13%) (Figure 2D). Moreover, 410 mutant loci were showing nonsynonymous variations and 189 mutant loci were showing nonsynonymous variations (Figure 2E,F). Most (91%) of nonsynonymous variations were with less than five strains, and only 2% of them were with more than 20 strains, especially in S: 23403 (4486 strains) and S: 21575 (42 strains) (Figure 2G,H).
3.3 Current lineage and the present mutation characteristics of the S gene
We further integrated the current epidemic lineage and the mutation feature of S protein in SARS-Cov-2 strains as of March 23, 2022. At the end of 2021, the prevalent variant transformed from Delta variant to Omicron, including BA.1 and BA.2 variants (Figure 3A). Among all the amino acid changes (frequency > 0.5), D614G was the only mutant position that was with a frequency > 0.9 in every lineage. In addition, Omicron had the highest number of mutant amino acid positions than other variants. Many amino acid changes showed in the Omicron variant for the first time like S371P, S373P, N764K, N856K, and N969K (Figure 3B).

Compared to the early SARS-Cov-2 strains, we found S: 23403 were still the loci with the greatest number of mutant strains, but tremendously increased to 4 713 032 strains, which is approximately 1050 times the number as of April 2020. The other top loci with a high number of strains include S: 23604 (3 970 507 strains), S: 22995 (3 122 491 strains) and S: 21618 (2 432 835 strains) (Figure 3C,D). From March 2020, the mutant frequency of these four loci increased with time, among which S: 23403 was the most rapid one to reach a frequency > 0.9 (Figure 3D).
3.4 Geographical distribution of the mutant strains
To further analyze the geographical distribution of the nonsynonymous mutant strains, we integrated the laboratory information and detected data of mutant strains from the 2019nCoVR database. These 4486 strains showed nonsynonymous mutation at locus 23403 (S: 23403_QHD43416.1: p.614-; QHD43416.1: p.614D > G), distributed in 53 countries. Although some countries showed high mutation frequency with a low number of strains, most of the mutant strains were from the United States, the United Kingdom, and Iceland, with medium mutation frequency (Figure 4A,B).

Compared to the early strains, the total number of the current strains showing nonsynonymous mutation at locus 23403 was 4 713 032. Mutant strains in countries like Germany and Japan were now significantly raising. Most of the mutant strains were still from the United States, and the United Kingdom, as well as Denmark and Germany. Mutant frequency was lowest in Niger, which was 0.4615 (Figure 4C,D).
3.5 Mutation of ACE2-binding region of the S gene in SARS-CoV-2 strains
It was reported that the ACE2-binding region of SARS-CoV-2-mediated receptor recognition with human ACE2 in viral infection.5, 7 We then focused on the ACE2-binding region of S protein from the residues 336 to 516 to display the mutant strains' distribution at line sequence. Although there were 68 genome mutations, only one locus (S: 23010) with 20 strains and 71% of the loci with only one strain (Figure 5A).

Among the 51 nonsynonymous mutant loci, only amino acid mutation at 483 (p.483V > A) contained 20 strains and 75% of the loci contained only one strain (Figure 5B). Besides, 19 loci were synonymous mutations (Figure 5C).
Compared to the ACE2-binding region in early strains, locus 22995 (p.478) was with the highest mutant strain number, 3 122 491 strains, followed by locus 22917 and locus 23063 (Figure 5D). Ninety percent of the mutant loci in the ACE2-binding region were nonsynonymous mutations (Figure 5E).
3.6 3D structure of the ACE2-binding region in S protein
We integrated the 3D structure of the ACE2-binding region in the S protein. Amino acid mutation positions p., p.478, p.476, and 414 were close in S protein modeling (Figure 6).

4 DISCUSSION
COVID-19, a novel respiratory disease caused by SARS-CoV-2, has become a global pandemic. Although Food and Drug Administration approved Veklury (remdesivir) as the first and the only antiviral drug for treating COVID-19 on October 22, 2020,22 the early randomized, double-blinded, controlled clinical trial showed no difference in time to clinical improvement between remdesivir and placebo.23 Results from SARS-CoV vaccines, an inactivated virus vaccine or a spike-based DNA vaccine, indicated encouraging performance with safe and neutralizing antibody titers24, 25
As previous findings have suggested that vaccines targeting S protein could induce immune responses and protective efficiency for SARS-CoV and MERS,26-28 an Ad5-nCo vaccine (a recombinant Ad5-vectored COVID-19 vaccine) targeting full-length S protein in Phase I clinical testing (NCT04313127) also testified the immunogenicity for SARS-CoV-2.12 However, it was reported that vaccines expressing full-length S protein could not only induce nonneutralizing antibodies in the host29, 30 but also facilitate the virus entry into host cells via the FcγR-dependent pathway.31 Moreover, SARS-CoV-2 has been evolving exponentially and undergoing mutations since its outburst,13 making it necessary to determine the mutation landscape of the S gene and S protein. In general, we integrated and comparatively analyze the early and current mutant heatmap and dynamic variation curve of the S gene and the amino acid mutations of S protein, demonstrating that, although up to the present, the epidemic displayed tremendous evolution compared to the early time, with the nonsynonymous mutant locus S: 23403 having the highest number of strains among all mutations in the S protein. Meanwhile, the United States and the United Kingdom are the two top-ranking countries for the number of strains.
Since May 2020, the spike mutation of D614G was revealed globally, which was considered correlated with the increased viral infection14 and higher RNA loads in the nasopharyngeal tract of COVID-19 patients. It was demonstrated that the spike D614G substitution increased infectivity of SARS-CoV-2 in human lung epithelial cell lines and in a primary airway tissue model. The stability of the G614 virus was reported to be increased. However, G614 SARS-CoV-2 was more susceptible to neutralization by sera from hamsters infected with D614 virus, which eased the concern about D614G mutation causing resistance to the previous COVID-19 vaccine based on the D614 sequence.32 The current mutation heatmap indicated that D614G was also a mutant amino acid position with high frequency in Omicron. Since Omicron transmitted at a significantly higher rate than the previous variants and was reported as escapable from almost all approved antibody treatments, the countermeasures including vaccines received unprecedented concerns. However, a booster was believed to be effective to protect from Omicron infection for a three-dose vaccine could increase the proportion of broad-spectrum antibodies by B memory cells.33 In general, the emergence of SARS-CoV-2 variants may affect COVID-19 vaccine development and antibody treatment;34 thus, the immunogenicity of novel COVID-19 vaccines for diverse S protein variants needs to be further analyzed.
Accumulating evidence demonstrated that subunit vaccines targeting RBD of the S protein showed a higher safety profile among current vaccines despite the low immunogenicity in the host.35, 36 Unlike the full-length S protein, RBD comprises the critical neutralizing regions but lacks the nonneutralizing domains, which restricted RBD-based vaccines from producing neutralizing immune responses.36, 37 Moreover, RBD-specific vaccines of MERS-CoV were reported to produce neutralizing antibodies against multiple strains with a single mutation in epitopes, due to there being several conformational neutralizing epitopes.37, 38 In our work, we found that mutant strains in the RBD region were less than those outside the RBD region of the early S protein, especially at the locus S: 23403 (p.614-/p.614D > G), which was previously identified in European countries, such as the Netherlands, Switzerland, and France.14 However, compared to the early time, the current difference of mutated strains between the RBD region and the outside region shown to be reduced since S: 22995 (p.478) in the RBD region mutated with more than 3 million strains and 90% of the mutations were nonsynonymous.
Still, our analyses only focused on the mutant features of S protein in SARS-CoV-2 strains, lacking functional analyses of diverse variants. Besides, further experimental assays in T and B cells are necessary to identify the potential epitopes for inducing the neutralizing response against SARS-CoV-2.
5 CONCLUSIONS
Overall, we comprehensively analyze the early and the current prevalent SARS-CoV-2 strains to identify the spatiotemporal features of the genome mutations of the S gene and S protein over time and over countries, demonstrating the mutation landscape toward the vaccine development against SARS-CoV-2. More generally, the surge of mutant strains and frequency highlights the urgency of further studies on effective vaccine development, especially S protein-based vaccines, in concert with emerging SARS-CoV-2 variants of the COVID-19 epidemic.
AUTHOR CONTRIBUTIONS
Writing—original draft: Rang Liu and Canhui Cao. Writing—review and editing: Canhui Cao, Yu Shi, and Xi Xia. Conceptualization: Canhui Cao. Data curation: Xinran Lin, Bing Chen, Zhenhui Hou, and Qiuju Zhang. Formal analysis and methodology: Rang Liu and Canhui Cao. Visualization: Shouren Lin, Lan Geng, and Zhongyi Sun. Project administration: Canhui Cao, Yu Shi, and Xi Xia. Supervision: Yu Shi and Xi Xia.
ACKNOWLEDGMENTS
We thank all workers involved in fighting against COVID-19. We would like to acknowledge the platform 2019nCoVR, CNGBdb, GenBank, GISAID, GWH, NMDC and SWISS-MODEL server. This work was supported by the Research Team of Female Reproductive Health and Fertility Preservation (SZSM201612065), the National Natural Science Foundation (No. 81471508), China Postdoctoral Science Foundation (2021M702223), and Shenzhen Science and Technology Innovation Committee (JCYJ20210324105808022, RCBS20210706092345027), and the Shenzhen High-level Hospital Construction Fund.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
Open Research
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in the 2019 Novel Coronavirus Resource (2019nCoVR) database (https://bigd.big.ac.cn/ncov).15