Whole-genome sequencing of SARS-CoV-2: Comparison of target capture and amplicon single molecule real-time sequencing protocols
Abstract
Fast, accurate sequencing methods are needed to identify new variants and genetic mutations of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome. Single-molecule real-time (SMRT) Pacific Biosciences (PacBio) provides long, highly accurate sequences by circular consensus reads. This study compares the performance of a target capture SMRT PacBio protocol for whole-genome sequencing (WGS) of SARS-CoV-2 to that of an amplicon PacBio SMRT sequencing protocol. The median genome coverage was higher (p < 0.05) with the target capture protocol (99.3% [interquartile range, IQR: 96.3–99.5]) than with the amplicon protocol (99.3% [IQR: 69.9–99.3]). The clades of 65 samples determined with both protocols were 100% concordant. After adjusting for Ct values, S gene coverage was higher with the target capture protocol than with the amplicon protocol. After stratification on Ct values, higher S gene coverage with the target capture protocol was observed only for samples with Ct > 17 (p < 0.01). PacBio SMRT sequencing protocols appear to be suitable for WGS, genotyping, and detecting mutations of SARS-CoV-2.
1 INTRODUCTION
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a member of the Coronaviridae family that caused the COVID-19 pandemic,1, 2 has a single-stranded positive-sense RNA genome of approximately 29.9 kb that encodes several proteins, including the spike (S) structural protein.3 The 1273 amino-acid long S protein attaches the virus to the host cell receptor via its receptor binding domain (RBD, residues 319–529).4 The virus genome accumulates mutations that are associated with transmissibility, escape to neutralizing monoclonal antibodies (mAbS), and virulence.5 The SARS-CoV-2 genome has diverged during the pandemic to produce several variants (clades and lineages) that differ in their biology and/or geographical distribution. Five of these variants have been classified as variants of concern6: Alpha (B.1.1.7),7 Beta (B.1.351),8 Gamma (P.1),9 and more recently Delta (B.1.617.2), and Omicron (B.1.1.529).10, 11
The rapid identification of variants that are transmitted most efficiently or that escape the host immune response is essential for genomic surveillance and clinical management. Next-generation sequencing (NGS) protocols, mostly based on Illumina and Oxford Nanopore Technologies (ONT) platforms,12-14 have been developed to study the genomic diversity of SARS-CoV-2 worldwide. Metagenomic NGS was used to sequence the complete SARS-CoV-2 genome early in the pandemic.15 Since then, multiplex amplicon or hybrid capture-based methods have been developed to run on the Illumina16-19 and ONT platforms.16, 20 Illumina sequencing devices are commonly used; they generate accurate, high-quality sequences. ONT is based on long-read sequencing and offers real-time sequencing, but the sequences may contain substantial errors.21 The Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing system provides long, highly accurate sequences by using circular consensus sequencing reads; this feature could provide accurate whole-genome sequences (WGS). PacBio SMRT sequencing has already been used to sequence the S gene22-25 but no data are available for the whole SARS-CoV-2 genome.
This study assesses the performance of a newly-available complete end-to-end kit of a target capture SMRT sequencing protocol for genotyping of SARS-CoV-2 and detecting mutations. Results were compared with those obtained with a 1.2 kb amplicon SMRT sequencing protocol.
2 MATERIALS AND METHODS
2.1 Samples
We sequenced 84 nasopharyngeal samples from patients who were SARS-CoV-2 RNA positive (N gene cycle thresholds, Ct 8–28) and stored them at −80°C in the Virology Laboratory at Toulouse University Hospital. The Ct values were obtained with the QuantStudioTM 5 Real-Time PCR system (Applied Biosystems). A total of 19 were taken between January 19 and March 19, 2021, during the Alpha wave in France, and 65 were taken between July 18 and August 8, 2021, at the beginning of the Delta wave in France. Two synthetic positive RNA controls (Twist Bioscience Control 14 [TB14] England/205041766/2020 [lineage B.1.1.7] and Twist Bioscience Control 17 [TB17] Japan/IC-0564/2021 [lineage P.1]), one strain of SARS-CoV-2 B.1.1.254/20B (EPI_ISL_804374 from a culture supernatant), a negative template control (NTC), and one SARS-CoV-2 negative sample were also analyzed.
2.2 SARS-CoV-2 RNA extraction
Virus RNA was extracted from a 180 µl transport medium with the MGIEasy Nucleic Acid Extraction Kit on the MGI SP 960 system (Beijing Genome Institute) according to the manufacturer's instructions.
2.3 Target capture PacBio SMRT sequencing
We used the SARS-CoV-2 Enrichment Early Access (PacBio) Kit following the PacBio HiFiViral for SARS-CoV-2 workflow, High-Throughput Multiplexing, 800 bp Target Capture for Full-Viral Genome Sequencing of SARS-CoV-2 (hereafter referred to as « target capture »). This protocol uses molecular inversion probes with a 675 bp target insert. The first step was simultaneous complementary DNA (cDNA) synthesis and probe hybridization. The 8 µl RT-hybridization reaction mixture contained 6 µl RNA, 1.6 µl RT mix, and 0.4 µl probe mix. The cycle steps were: 10 min/25°C, 50 min/50°C, 1 min/95°C, and 16 h/55°C. Fill-in mix (2 µl) was then added to each sample (55°C/60 min), followed by clean-up mix (2 µl) for 60 min/45°C, 3 min/95°C, and hold 4°C. The second step was circularization. The cDNA amplification mixture, 24 µl containing 9.6 µl clean-up reaction, 12 µl polymerase chain reaction (PCR) mix, and 2.4 µl asymmetric barcoded M13 Primer mix were subjected to 3 min/95°C, 26 cycles of 98°C/15 s, 55°C/15 s, and 72°C/90 s. Samples were pooled for library construction: aliquots (5 µl) of each sample were pooled in DNA LoBind tubes and purified with 1.3X AMPure PacBio beads (Pacific Bioscience), and the cDNA was quantified with the Quantifluor DSDNA system running on a Roche LC480 instrument. A SMRT bell library was prepared and sequenced with the SMRTbell Express Template Prep 2.0 Kit according to the manufacturer's instructions running on a Sequel IIe platform (Genotoul platform; GeTPlaGe) loading 200 pM.
Bioinformatic analysis was done with SMRT Link software using the HiFiViral SARS-CoV-2 Analysis Application with the default parameters (https://www.pacb.com/wp-content/uploads/SMRT_Link_User_Guide_v10.2.pdf), except for the minimum base coverage, which was changed to 10 reads (ECDC recommendations for SARS-CoV-2 sequencing26). Consensus sequences were then analyzed with Pangolin lineages (v3.1.20, https://cov-lineages.org/resources/pangolin.html) and Nextstrain clades (v1.11.0, https://nextstrain.org/sars-cov-2). Finally, a customized python script (v3.8.8) analysis was used to generate a user-friendly report, including the number of mapped reads, median coverage, list of S gene mutations (substitutions, insertions, and deletions), S gene missing positions, and previously found clades and lineages. Target capture protocol steps are summarized in Figure 1.

2.4 1.2 kb amplicons PacBio SMRT sequencing
We used the PacBio HiFiViral for SARS-CoV-2 workflow, High-Throughput Multiplexing 1.2 kb Amplicons for Full-Viral Genome Sequencing of SARS-CoV-2, based on 29 1.2 kb amplicons across the SARS-CoV-2 genome (hereafter referred to as « amplicon »). We first generated cDNAs. Aliquots (20 µl) of RT reaction mixture containing 10 µl RNA, 1.9 µl nuclease-free water, 2 µl Superscript IV VILO (Life Technologies), and 0.1 µl of random hexamer oligo(dT) primers (100 µM) were incubated. Cycle steps: 10 min/23°C, 60 min/50°C, and 10 min/80°C. The 29 amplicons were amplified by two multiplex PCR (with two primer pools, Pool 1 and Pool 2) in 25 µl of reaction mixture each: 5 µl cDNA, 0.5 µl Q5 Hot Start High Fidelity DNA polymerase, 12 µl nuclease-free water, 1.5 µl of Pool 1 or 2 M13 tailed primers, 1 µl 10 nM deoxynucleoside triphosphates and 5 µl Q5 Reaction Buffer. The cycle steps were denaturation (98°C/30 s), amplification, and extension (35 cycles of 98°C/15 s, 65°C/5 min). A second PCR was then performed using barcoding primers tailed with the universal M13 sequence. Reaction mixture: 12.5 µl Kapa HiFi HotStart ReadyMix, 4.5 µl nuclease-free water, 5 µl of M13 forward and reverse barcoded primer mixture, and 3 µl aliquots of first-round PCR products for Pool 1 or Pool 2. Cycling conditions: 98°C/3 min, 3 cycles of 98°C/30 s, 60°C/15 s, and 72°C/1 min, then 21 cycles of 98°C/20 s, 65°C/15 s, 72°C/1 min, final extension 72°C/5 min. Samples were pooled for library construction: 1 µl of each sample from Pool 1 and Pool 2 was transferred in DNA LoBind tubes, purified with AMPure PacBio beads (Pacific Bioscience) at 0.60X, and quantified with the Quantifluor DSDNA system running on a Roche LC480 instrument. SMRT bell libraries were constructed by pooling the barcoded samples. Barcoded amplicon libraries were prepared and sequenced with SMRTbell Express Template Prep 2.0 Kits according to the manufacturer's instructions on a Sequel IIe platform (Toulouse University Hospital).
Bioinformatic analysis was done with the Snakemake pipeline for complete genome construction. This starts by demultiplexing HiFi reads with lima (v.2.2.0, https://github.com/PacificBiosciences/barcoding), then creates the VCF file from pbAA clusters (PacBio tool, v.0.1.3 https://github.com/PacificBiosciences/pbAA), which is used, along with the samtools (v1.12) depth file, to build the consensus sequence (CoSA, coronavirus Sequence Analysis, v9.0.0, https://github.com/Magdoll/CoSA). A minimum base coverage of 10 reads was defined to report a position. The consensus sequences were then analyzed with Pangolin (v3.1.20, https://cov-lineages.org/resources/pangolin.html) for lineages and Nextstrain (v1.11.0, https://nextstrain.org/sars-cov-2) for clades. Our python script (v3.8.8) in-house analysis was used to generate a user-friendly report including the number of mapped reads, the median coverage, a list of S gene mutations (substitutions, insertions, and deletions), S gene missing positions, and the previously found clades and lineages. Amplicon protocol steps are summarized in Figure 1.
See Supporting Information: Table 1 for GISAID accession numbers.
2.5 Statistical analysis
Categorical variables were tested by a χ2 test. Continuous variables were tested by a Wilcoxon sign-rank sum test. Multivariable logistic regression models were used to assess the impact of the protocol and Ct values on genome coverage. p-values of <0.05 were considered to be significant.
3 RESULTS
3.1 Clades and lineages from full-length genomes determined by PacBio SMRT target capture
The complete genomes of positive controls (TB14, TB17, and EPI_ISL_804374) were covered 86.4%, 92.8%, and 99.6%, respectively, with a median read depth of 33, 33, and 1137. Fewreads (n = 16) were detected in the NTC and the SARS-CoV-2 negative sample (n = 11) without genome coverage. Samples 26 and 82 were tested in triplicate with N gene Ct of 14.4 and 14.6. Read numbers (sample 26: 16 993–32 327, sample 82: 48 382–67 576), median read depth (Sample 26: 355–662, Sample 82: 945–1413), and genome coverage (Sample 26: 99.6 ± 0.1%, Sample 82: 99.6%) were similar for each triplicate. Each of the triplicate analyses found the same clades and lineages (Sample 26: 20I (Alpha, V1)/B.1.1.7; sample 82: 21J (Delta)/AY125). S gene mutation profiles were also identical. We sequenced 84 samples (median Ct: 17.6 [interquartile range, IQR: 14.2–22.5]). Sequencing failed for five samples (6%). Sequence data were obtained for 79 samples with a median read number per sample of 13 693 [IQR: 1066–27 338], and a median read depth of 231 [IQR: 19–568 (Figure 2A and Supporting Information: Table S1). The median genome coverage was 98.9% [IQR: 85.9–99.5], and >95% for 53 (63%) samples. Most of the strains were clade 21J (Delta) (36; 45%) and clade 20I (Alpha, V1) (29; 37%). Eleven samples (14%) were determined only to the clade level but their complete genome coverage was too low to determine the lineage. Sample 16 (genome coverage of 30.9%) was flagged by SMRT link analysis as having multiple strains (probability > 0.95), indicating possible sample crossover.

3.2 Clades and lineages from full-length genomes determined by PacBio SMRT amplicon
We also sequenced the 84 samples with the amplicon protocol (Supporting Information: Table S1). Sequencing failed for 18 samples (21.4%). Sequence data were obtained for 66 samples, with a median read number per sample of 8036 [IQR: 2131–20 597] and a median read depth of 881 [IQR: 411–1289 (Figure 2B and Supporting Information: Table S1). The median genome coverage was 99.3% [IQR: 69.9–99.3], and >95% for 41 (62%) samples. The strains were clade 21J (Delta) (30; 45%) and clade 20I (Alpha, V1) (26; 39%). Sixteen samples (24%) were determined only to the clade level but their complete genome coverage was too low to determine the lineage. No sample crossover was observed.
3.3 Comparison of the complete genome sequences obtained with the two methods
Extraction and library preparation required 3 days for both protocols (Figure 1). The target capture protocol used a single plate for 96 samples throughout the whole process, whereas the amplicon protocol needed two plates for two steps. The target capture protocol sequencing run is shorter (8 h) than that of the amplicon protocol (15 h). Sequencing failed for more samples (p < 0.01) with the amplicon (n = 18) than with the target capture protocol (n = 5). Four samples failed to sequence with both protocols (2/4 with Ct > 25), 14 samples with the amplicon protocol only (5/14 with Ct > 25), and sample 32 with the target capture protocol only (Ct = 27). Among the 14 that failed to sequence with the amplicon protocol, 10 (71%) were sequenced with a genome coverage >74% with the target capture protocol, allowing clade and lineage assignment. Sample 32 was sequenced with a genome coverage of 63% with the amplicon protocol allowing only clade assignment. The median genome coverage compared of 65 samples was significantly higher (p < 0.05) for the target capture protocol (99.3 [IQR: 96.3–99.5]) than for the amplicon protocol (99.3 [IQR: 69.9–99.3]). Genome coverage was >95% for 52 (80%) samples analyzed by the target capture protocol and for 41 (63%) samples by the amplicon protocol (p < 0.05). Genome coverage for samples with high N gene Ct decreased for both protocols (Figure 3A). The clades of 65 samples were determined with both protocols, with 100% concordant results. The lineage of 49 samples was determined with both protocols with 98% concordant results. Sample 44 were determined AY.43 with the amplicon protocol (genome coverage = 69.7%) and AY.4 with the target capture protocol (genome coverage = 90.4%).

3.4 Comparison of the spike sequences obtained with the two methods
The target capture protocol failed to sequence the S gene in 7/84 (8%) samples and the amplicon protocol in 19/84 (23%) samples (p < 0.01). The median S gene coverage compared to 64 samples was significantly higher (p < 0.05) for the target capture protocol (100 [IQR: 100–100]) than for the amplicon protocol (100 [IQR: 52–100]). S gene coverage was >95% for 62 samples (81%) analyzed with the target capture protocol and for 44 samples (68%) with the amplicon protocol (p < 0.08). S gene coverage was higher for the target capture than for the amplicon protocol (Figure 3B). After stratification on Ct values (median 17 IQR: 13.8–20.6), the target capture protocol gave higher S gene coverage than the amplicon protocol only for Ct > 17 (p < 0.01). The mutations detected in 43/64 (67%) samples with the target capture and amplicon protocols were 100% concordant. The target capture protocol detected mutations that were missed by the amplicon protocol in 17/21 (81%) samples and the amplicon protocol detected mutations in four (19%) samples that were not detected by target capture (Table 1). Mutations missed by the amplicon protocol were due to failure to amplify 2/4 amplicons covering the S gene in 18 samples and 3/4 amplicons in three samples. Mutations missed by the target capture protocol were due to failure to sequence the almost whole S gene in 3/4 samples.
Sample | Cladea | Ct value | Target capture | Amplicon | ||||
---|---|---|---|---|---|---|---|---|
S gene coverage (%) | Missing position in spike (AA) | Mutation detected | S gene coverage (%) | Missing position in spike (AA)b | Mutation detected | |||
1 | 21J (Delta) | 25 | 99.6 | 522–526 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 51.98 | 683–1274 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G |
2 | 21J (Delta) | 21 | 100 | _ | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 45.88 | 343–1032 | T19R, G142D, E156-, F157-, R158G |
3 | 21J (Delta) | 24.7 | 3 | 1–1236 | 51.98 | 683–1274 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G | |
4 | 21J (Delta) | 25 | 7.7 | 1–1176 | 45.88 | 343–1032 | T19R, T95I, G142D, R158G, E156-, F157- | |
9 | 21J (Delta) | 17.6 | 100 | _ | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 45.88 | 343–1032 | T19R, G142D, E156-, F157-, R158G |
10 | 21J (Delta) | 22.3 | 89.7 | 255-296,522-607 | T19R, G142D, E156-, F157-, R158G, P251L, L452R, T478K, D614G, P681R, D950N | 51.98 | 683–1274 | T19R, G142D, E156-, F157-, R158G, P251L, L452R, T478K, D614G |
13 | 20I (Alpha, V1) | 18.3 | 100 | _ | H69-, V70-, Y144-, N501Y, A570D, D614G, P681H, T716I, S982A, D1118H | 26.46 | 343–1032, 1051–1274 | H69-, V70-, Y144- |
14 | 21J (Delta) | 21.6 | 85.9 | 22–26, 252–296, 512–577, 803–866 | T19R, T95I, G142D, E156-, F157-, R158G, L452R, T478K, D614G, Q677H, P681R, D950N | 47.58 | 343–661, 683–1032 | T19R, T95I, E156-, F157-, G142D, R158G |
19 | 21J (Delta) | 24 | 99.6 | 22–26 | T19R, T95I, G142D, E156-, F157-, R158G, L452R, T478K, D614G, Q677H, P681R, D950N | 51.98 | 683–1274 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G |
23 | 21J (Delta) | 19.9 | 99.2 | 252–256, 292–296 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 53.44 | 683–1032, 1051–-1274 | T19R, T95I, G142D, E156-, F157-, R158G |
24 | 21J (Delta) | 22.5 | 100 | _ | T19R, T29A, G142D, E156-, F157-, R158G, T250I, T299I, L452R, I468V, T478K, Q613H, D614G, P681R, D950N | 25.31 | 343–1274 | T19R, T29A, G142D, E156-, F157-, R158G, T250I, T299I |
29 | 21J (Delta) | 21.1 | 100 | _ | E156-, F157-, T19R, G142D, R158G, L452R, T478K, D614G, P681R, D950N | 47.58 | 343–661, 683–1032 | T19R, G142D, E156-, F157-, R158G |
32 | 19A | 27.0 | NA | 1–1273 | 51.98 | 683–1274 | E156-, F157-, T19R, G142D, R158G, L452R, T478K, D614G | |
35 | 21J (Delta) | 25.0 | 94.2 | 22–26, 522–526 | T19R, T95I, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 47.58 | 343–661, 683–1032 | T19R,T95I,G142D,E156-,F157-,R158G |
37 | 21J (Delta) | 22.3 | 97.6 | 262–286, 292–296 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 47.58 | 343–661, 683–1032 | T19R, G142D, E156-, F157-, R158G |
44 | 21J (Delta) | 22.6 | 100 | _ | T19R, T95I, G142D, E156-, F157-, R158G, L452R, T478K, D614G, P681R, D950N | 53.47 | 683–1032, 1051–1274 | T19R, G142D, E156-, F157-, R158G, L452R, T478K, D614G |
46 | 21J (Delta) | 24 | 64.8 | 1–26, 252–296, 522-866, 1093-1126 | G142D, E156-, F157-, R158G, L452R, T478K,D950N | 45.88 | 343–1032 | L5F, T19R, G142D, E156-, F157-, R158G |
47 | 21J (Delta) | 25 | 87.6 | 1–26, 252–296, 522–526, 752–786, 799, 801, 803–816, 833–866 | G142D, E156-, F157-, R158G, L452R, L455F, T478K, D614G, P681R, D950N | 28.32 | 343–661, 681–1032, 1051–1274 | T19R, G142D, E156-, F157-, R158G |
63 | 20A | 22.7 | 7.7 | 1–1176 | 48.6 | 20–325, 683–1032 | S477N, D614G, V1094I | |
71 | 20E (EU1) | 17 | 100 | _ | A222V, D614G | 57.5 | 343–661, 1051–1274 | A222V |
77 | 20I (Alpha, V1) | 22.5 | 100 | _ | H69-, V70-, Y144-, N501Y, A570D, D614G, P681H, T716I, S982A, A1020S, D1118H | 48.1 | 1–661 | P681H, T716I, S982A, A1020S, D1118H |
- Note: Mutations in bold were detected by only one of the protocols.
- Abbreviations: AA, amino acid; NA, not applicable.
- a Clade based on target capture or amplicon analysis (if not determined with the target capture sequence).
- b Amplicons covering S gene with 1.2 kb protocol: NC_045512.2: 21 563–22 612 (spike AA mutations: 1–350), NC_045512.2: 22 538–23 631 (spike AA mutations, covering the receptor binding domain (RBD): 325–690), NC_045512.2: 23 545–24 736 (spike AA mutations: 661–1058) and NC_045512.2: 24 659–25 790 (spike AA mutations: 1032–1409).
4 DISCUSSION
Large-scale, high-throughput WGS of SARS-CoV-2 is essential for rapid surveillance and efficient follow-up of the spread of new variants, particularly immune-escaping variants that might interfere with vaccination or treatment. Our work focuses on SARS-CoV-2 WGS with Pacbio SMRT sequencing, while most published data are based on Illumina and ONT sequencing. We find that Pacbio SMRT sequencing is suitable for WGS of SARS-CoV-2, identifying variants and detecting mutations.
Both the target capture and amplicon protocols provided good genome coverage (median > 99%). Clades and lineages were concordant with both methods except for the lineage of one sample due to the low genome coverage obtained with amplicon protocol, leading to misclassification. Sample 16 was flagged as having multiple strains with the target capture protocol, indicating possible contamination whereas no contamination was observed with the amplicon protocol. Further investigations were not possible due to the low genome coverage and low read depth obtained with both protocols. As sample crossover has already been reported with sequencing protocols on the Illumina platform,27 decontamination strategies are clearly important.28
The target capture protocol was more sensitive than the amplicon protocol. In fact, a higher number of samples were successfully sequenced on the whole genome, a higher number of sequences had a genome coverage >95% (the minimum coverage value recommended by ECDC26) and a higher S gene coverage was observed for samples with Ct > 17. The target capture protocol amplifies a short 675 bp fragment of the SARS-CoV-2 genome and each base is covered by around 20 probes, while the amplicon protocol amplifies a 1.2 kb fragment. This is probably why target capture performed better and provide better genome coverage for samples with a low viral load.
For quasispecies studies where haplotyping is necessary, protocols amplifying long fragments should be preferred. With the amplicon protocol, the S gene can be sequenced with only four amplicons, one covering the RBD. This allows haplotyping of the main mutations in the S gene. PacBio SMRT sequencing of the S gene with amplicon lengths of 2.5–6.1 kb has been used to study spike gene quasispecies.22-25 Sun et al.24 showed that the virus population may consist of one predominant haplotype combined with numerous minor haplotypes and that different quasispecies complexity is observed depending on the tissue suggesting independent replication. SARS-CoV-2 spike protein evolution in an infected patient treated with mAbs indicated that key activity-reducing mutations can appear in patients treated with mAbs.25 Long amplicon sequencing can provide accurate monitoring of SARS-CoV-2 quasispecies for compartmentalization studies or surveillance of mutations escaping variants for patients treated with mAbs.
The two laboratory workflows take 3 days from extraction to sequencing. The target capture protocol workflow is simpler, using only one plate throughout the process, which reduces the risks of technical errors and sample contamination. Automation is possible, but only with specific small-volume liquid handlers, whereas the amplicon protocol can be automated on the usual liquid handlers more readily available in the laboratory. The target capture sequencing time is shorter, providing results sooner. Target capture sequencing is analyzed with an automatic Pacbio analysis application on an SMRT link and can be run by a biologist on a computer. The amplicon protocol, in contrast, requires the development of a bioinformatic pipeline by a bioinformatician and specific computing resources. The target capture protocol is a complete end-to-end solution that is easier and faster to implement than the amplicon protocol.
The target capture protocol for SARS-CoV-2 WGS provided data that were similar to those obtained in previous studies with Illumina or ONT sequencing. SARS-CoV-2 WGS with amplicons on the Illumina platform generated sequences with >95% genome coverage for 67% of Delta variant samples.29 We obtained similar results, with >95% genome coverage for 63% of samples. Some (19/84) of the samples sequenced by the target capture Pacbio SMRT system had been previously sequenced using the CovidSeq Illumina protocol with a similar median genome coverage (data not shown). Other studies obtained 90%–100% genome coverage with ONT sequencing.30-32 Genome coverage was >99% for all high viral load samples (Ct < 20) of SARS-CoV-2 sequenced using different protocols (mNGS, hybrid-capture-based enrichment, or amplicon-based protocols with Illumina sequencing and amplicon-based protocols with ONT sequencing).16 The genome coverage was >98% for 94% of our samples with a Ct < 20. The small differences between studies could be due to the protocol used or to the variants sequenced, depending on the pandemic wave during which the samples were collected.
Newly-emerging strains with increased numbers of mutations can be a challenge for sequencing, as mutated primer-binding sites may cause amplicon dropout or uneven sequencing coverage, resulting in lost or inaccurate data. The mutations in the Omicron variant introduced at the end of 2021 can decrease the enrichment efficiency of PCR amplification.33 No Omicron variant samples were sequenced in this study, but the amplification could be influenced if the primers hybridize to regions with mutations. The SARS-CoV-2 target capture probes and 1.2 kb amplicon primers will probably have to be optimized to ensure good amplification with Omicron or other new variants. For example, ARTIC Network V4 primers were proposed to optimize the sequencing of Delta variants29 and ARTIC Network V4.1 primers were recently proposed for Omicron sequencing (https://community.artic.network/t/sars-cov-2-v4-1-update-for-omicron-variant). A recent study demonstrated that combining short and long sequencing data, obtained by combining Illumina and ONT sequencing, improved genome coverage and provided uniform, maximum genome coverage.30 Short and long-read sequencing could be done on a single sequencing platform with the target capture and amplicon Pacbio SMRT sequencing protocols.
To conclude, we find the PacBio SMRT technology suitable for the WGS of SARS-CoV-2, the rapid identification of circulating SARS-CoV-2 variants, and mutation detection. The genome coverage of the target capture protocol was similar to that of Illumina or ONT sequencing and the accurate long reads produced by the amplicon SMRT sequencing protocol can be used to study quasispecies. Further studies are now needed to compare the performance of PacBio SMRT technology with those of other platforms for sequencing Omicron variants.
AUTHOR CONTRIBUTIONS
Pauline Trémeaux, Stéphanie Raymond, and Jacques Izopet designed the project. Noémie Ranger and Cécile Donnadieu carried out the experiments. Gérald Salin, Justine Latour, and Nicolas Jeanne performed bioinformatics analyses. Florence Nicot, Pauline Trémeaux, and Jacques Izopet analyzed the study data. Chloé Dimeglio carried out statistical analysis. Florence Nicot and Jacques Izopet wrote the manuscript. All the authors have approved the manuscript.
ACKNOWLEDGMENT
The authors would like to thank Dr Owen Parkes for editing the English text. Target capture reagents were provided by Pacific Biosciences.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
Open Research
DATA AVAILABILITY STATEMENT
The raw data are available on demand. Sequences are available on GISAID (accession numbers in Supporting Information: Table 1).