With the advancement of whole-genome sequencing (WGS) technology, massively parallel sequencing (MPS) remains the mainstream due to its accuracy, low cost, and high throughput. The development of the analytical pipeline corresponding to MPS has always been of great importance. Increasingly large population genomics studies, as a specific type of big data research, pose new challenges for analysis solutions.

Results

Here, we introduce ZBOLT, a comprehensive analysis system that incorporates both software and hardware advancements, making it an appropriate choice for large-scale population genomic studies that require extensive data processing. In this study, we first evaluate ZBOLT's calling accuracy using the Genome in a Bottle (GIAB) benchmark dataset. Then we apply ZBOLT to a large-scale population genomics study with 5,616 high sequencing depth samples totaling 1.16Pbp (base pair). As the results show, ZBOLT demonstrates exceptional efficiency and low energy consumption, processing 100Tbp per day and using 1kWh per 100Gbp sequenced sample.

Conclusion

This research serves as a valuable reference for analyzing sequencing data from large population cohorts and underscores the significant potential of ZBOLT in large-scale population genomics studies.

1 INTRODUCTION

Since the launch of the Human Genome Project (HGP), DNA sequencing technology has been developed rapidly.¹ Massively parallel sequencing (MPS) technologies, such as sequencing by synthesis (SBS)² and DNA nanoballs (DNBs) sequencing³ have been widely used in recent years due to their good performance in terms of accuracy, throughput, speed, and cost.^{4, 5} As sequencing costs decrease, an increasing number of large-scale population genomics programs are being conducted to investigate the genetic variation among different populations, uncover diverse genetic backgrounds and support precision medicine initiatives.⁶ Examples of such programs include UK Biobank,^{7, 8} TOPMed⁹ and “All of Us” of NIH.¹⁰ However, challenges remain in analysis efficiency, power consumption, and the cost of supporting infrastructure when performing analyses for larger population genomic studies.¹¹

To date, the majority of MPS analysis pipelines have been primarily designed to enhance individual analysis performance. The first widely recognized MPS analysis pipeline, the Genome Analysis ToolKit (GATK) Best Practices,¹² was released by the Broad Institute in 2013. This workflow includes key software such as BWA¹³ and GATK,^{14, 15} utilizing nontrivial statistical modules like logistic regression, Hidden Markov Model, naive Bayes classification and Gaussian mixture model for variants calling.¹⁵

Since then, the demand for timely acute diagnostics in the translational application of disease and genomic studies of an increasing number of research populations has driven the development of more efficient and accurate MPS analysis pipelines. Representative pipelines include Edico DRAGEN,¹⁶ Sentieon DNASeq¹⁷ and DeepVariant.¹⁸ Edico DRAGEN accelerates analyses primarily through field-programmable gate arrays (FPGAs).¹⁶ Sentieon DNAseq follows GATK Best Practices by rewriting codes for optimizing analysis efficiency without modifying statistical modules.¹⁷ With the advancement of deep learning, DeepVariant has been developed, utilizing deep convolutional neural networks to call variants.¹⁸ This approach has demonstrated higher accuracy and efficiency in calling SNPs and InDels compared to GATK Best Practices.^18-20

Population genomics studies, in contrast to individual studies, require the analysis and processing petabyte-scale data, presenting challenges in terms of accuracy, efficiency, and affordability.^{21, 22} Several existing pipelines heavily reliant on graphics processing unit (GPU) performance,^{23, 24} further complicating these challenges. In response to these issues, we explored ZBOLT, a high-performance resequencing analysis system. ZBOLT is designed to handle large-scale studies such as our newborn genomics and molecular epidemiology study consisting of 5,616 high-depth sequencing samples.

ZBOLT incorporates the MegaBOLT pipeline and is optimized for population analysis for WGS, including germline and somatic mutation calling, whole exome sequencing (WES), and targeted region sequencing. Adhering strictly to GATK Best Practices,¹² ZBOLT leverages a heterogeneous computing system, employing central processing unit (CPU) and FPGA resources, dynamic multi-task scheduling, and hardware configuration support for accelerated analysis. ZBOLT utilizes a customized scheduling algorithm designed specifically to optimize multi-node computing scheduling within heterogeneous environments. The acceleration of the process from FASTQ to variant call format (VCF) results is achieved through data segmentation, optimization of compression/decompression algorithms, and streamlined calculation models. Moreover, with ZBOLT's on-premise analysis system, we prioritize enhanced data security and privacy, providing a distinct advantage over traditional cloud computing solutions.

Using ZBOLT, we completed quality control, alignment, and variant calling of 5,616 high-depth WGS data (approximately 1.16P base pairs) in 11.6 days. ZBOLT's efficient use of energy and cabinet space alleviates the efficiency and cost challenges associated with large-scale genomics data, particularly in the context of high-depth WGS analysis.

2 MATERIALS AND METHODS

2.1 Accuracy evaluation using benchmark datasets

To assess the accuracy of the ZBOLT system, we conducted three replications of WGS analysis experiments using benchmark datasets from five individuals (HG001-HG005). These datasets consisted of five paired published FASTQ²⁵ with a coverage depth of approximately 30X per sample, sequenced by MGISEQ2000 and stored in GIAB²⁶ Github repository (Table S1). The reference genome used for the analysis experiment and evaluation was GRCh38. Corresponding benchmark (high-confidence) variant calls and regions were stored in VCF²⁷ and BED (Browser Extensible Data format) files marked NIST3.3.2 and GRCh38 (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/*/NISTv3.3.2/GRCh38/). The analysis results of the same samples running GATK (GATK3.8) best practices applied to the same samples were used as controls during the evaluation.

2.2 Experimental preparation in the newborn genomics study

To evaluate the performance of ZBOLT in large-scale data analysis, we collected samples from a newborn genomics study. In this study, we obtained 5,616 samples and extracted genomic DNA from newborn blood spots using the MagPure Tissue&Blood DNA LQ Kit (Magen, Guangzhou, China). After DNA extraction, Qubit 3.0 fluorometer (Life Technologies, Paisley, UK) was used to measure the DNA concentration and 2% agarose gel electrophoresis was used to assess DNA fragment integrity. Following the MGI's (MGI, Shenzhen, China) instructions, we proceeded to library construction and sequencing on the DIPSEQ platform of MGI (MGI, Shenzhen, China) with 100-bp paired-end reads. After sequencing, reads were stored in FASTQ²⁵ format for subsequent bioinformatics analysis with the ZBOLT system.

2.3 Accuracy evaluation process

We used the ZBOLT WGS analysis pipeline to call variants for the five public FASTQ datasets (HG001-HG005). Subsequently, we compared the SNP and InDel VCF²⁷ results with GIAB benchmark variants in the high-confidence region using RTG vcfeval.²⁸ Precision, sensitivity, and F-measure were calculated to evaluate the accuracy of the ZBOLT system. Additionally, we assessed the consistency of the ZBOLT system by comparing its analysis results with those obtained using the widely recognized GATK Best Practice pipeline.

2.4 ZBOLT system with MegaBOLT pipeline

The ZBOLT workflow, as illustrated in Figure 1, utilizes a heterogeneous computing system of CPU and FPGA to accelerate WGS analysis from FASTQ²⁵ input of sequencing data to BAM²⁸ output of alignment result, as well as VCF²⁷ / GVCF (Genomic Variant Call Format) output of variant calling result. ZBOLT employs a parallel computing architecture for FPGAs and processes data in smaller granules to enhance efficiency. A multi-task scheduling system and parallelized computing architecture improve the efficiency of WGS/WES analysis modules, including SOAPnuke,²⁹ BWA,¹³ GATK¹⁴ and MuTect2.³⁰ The documentation for ZBOLT is available on the official software repository at (https://en.mgi-tech.com/products/software_info/6/).

Details are in the caption following the image — **FIGURE 1**
Open in figure viewer PowerPoint

ZBOLT workflow that strictly follows GATK best practices. The workflow includes QC, reads mapping, sorting, duplicate marking, BQSR and germline/somatic variants calling modules.

2.5 Population genomics study evaluation

In the newborn population genomics study, we calculated the pain points during the process of large population genomics dataset, including cabinet space, analysis time, power consumption, data transfer rate and data storage performance. To examine the reliability of analyses in large population studies, we finally performed quality control on the results of population analyses.

3 RESULTS

3.1 ZBOLT system efficiency and performance

A ping-pong buffer³¹ structure is adopted in the data organization of FPGA global memory (FPGA GM) to achieve high I/O efficiency. Data segmentation, compression/decompression algorithm optimization, calculation model simplification, index optimization contribute to the acceleration of the process from FASTQ to VCF results.

ZBOLT is supported by ZTRON (Figure 2), a specific genomics data management system designed to cater to the specific needs of multi-computing node collaboration and large file processing in population genomics research. Table 1 presents a comparison of the performance of ZTRON and general products in terms of capacity, density, data transmission rate and data protection.

TABLE 1. Comparison results of ZTRON data management system and general products.

Capacity	1#	2#	3#	ZTRON data management system
	Regular products			ZTRON data management system
Density	36 disks per 4U chassis	60 disks per 4U chassis	84 disks per 5U chassis	106 disks per 4U chassis
Single node IO performance	2.8 GB/s	4 GB/s	7 GB/s	10 GB/s
Data transmission	Single 10/25 Gb/s			Multiple 100 Gb/s
Data protection	EC	RAID	RAID	RAID6 and hot disk redundancy backup

A RAID0 (Redundant Array of Independent Disks)³² configuration, consisting of two SSDs (Solid State Drives), enhances the I/O between ZBOLT and the storage module, caches all intermediate files in the calculation process and improves the system robustness. The customized scheduling algorithm facilitates multi-node computing scheduling in heterogeneous environments (Figure 3), intelligently allocating tasks to different nodes based on their resource utilization.

A proprietary file system and rule engine for data management enable the dynamic management of genetic data throughout the entire cycle, improving storage utilization rates. Cyclic redundancy check is employed to guarantee both data integrity and efficient data transmission speed during the data transfer process. The portable workflow description language and visual interface design allow for convenient editing of the analysis process when facing various research demands, such as somatic mutation analysis. The performance of the ZBOLT system is supported by multiple hardware and software configurations.

3.2 ZBOLT performance in individual analysis

We evaluated the accuracy of ZBOLT pipeline by comparing the called variants from ZBOLT VCF²⁷ outputs with the GIAB benchmark as baseline variants. The GATK Best Practices outputs were used as control, which analyzed the same sample with the same parameters. The F-measure was used to evaluate the accuracy of SNP and INDEL calling, as shown in Table 2. The average F-measure in SNP calling evaluation is 99.65%, and for INDEL calling evaluation is 99.098%. This indicates that the ZBOLT process had high accuracy for both SNP and INDEL calling. The results of the three replicates were also highly consistent, indicating the stability of the ZBOLT system (Table S2).

TABLE 2. Evaluation of the accuracy of GATK Best Practices and ZBOLT system. Taking the GIAB benchmark as baseline variants, GATK Best Practices/ZBOLT analysis VCF outputs as called variants, respectively.

Sample & PipelineHG001-HG005	Metrics
	SNP			INDEL
	Precision	Sensitivity	F-measure	Precision	Sensitivity	F-measure
HG001.GATK	99.55%	99.82%	99.68%	99.11%	98.99%	99.05%
HG001.ZBOLT	99.55%	99.82%	99.68%	99.11%	98.98%	99.05%
HG002.GATK	99.46%	99.82%	99.64%	99.29%	99.40%	99.34%
HG002.ZBOLT	99.46%	99.82%	99.64%	99.29%	99.40%	99.35%
HG003.GATK	99.49%	99.82%	99.66%	99.11%	99.04%	99.08%
HG003.ZBOLT	99.49%	99.82%	99.66%	99.11%	99.04%	99.08%
HG004.GATK	99.52%	99.76%	99.64%	99.04%	98.73%	98.89%
HG004.ZBOLT	99.52%	99.76%	99.64%	99.05%	98.73%	98.89%
HG005.GATK	99.43%	99.83%	99.63%	98.95%	99.28%	99.12%
HG005.ZBOLT	99.43%	99.83%	99.63%	98.95%	99.29%	99.12%

Moreover, the accuracy of the ZBOLT system was found to be comparable to that of GATK Best Practices when using the same inputs and parameters. The results of the GATK Best Practices analysis pipeline demonstrated that the average F-measure for SNP calling evaluation is 99.65%, while it is 99.097% for INDEL calling evaluation (Table S2).

An evaluation of the concordance between GATK Best Practices and ZBOLT was also conducted. As depicted in Table 3, the average F-measure in SNP calling was 99.99% and for INDEL calling was 99.97%. The results show that the two pipelines exhibit a high degree of consistency, further confirming that ZBOLT strictly follows GATK Best Practices while improving the analysis speed without compromising accuracy. Detailed evaluation results can be found in Table S3.

TABLE 3. Evaluation of the consistency of two pipelines output. Taking the outputs of GATK Best Practices as baseline, the outputs of ZBOLT as call set.

HG001-HG005(ZBOLT vs. GATK)	SNP			INDEL
HG001-HG005(ZBOLT vs. GATK)	Precision	Sensitivity	F-measure	Precision	Sensitivity	F-measure
HG001	99.98%	99.99%	99.99%	99.96%	99.97%	99.96%
HG002	99.99%	99.99%	99.99%	99.98%	99.98%	99.98%
HG003	99.99%	99.99%	99.99%	99.97%	99.96%	99.97%
HG004	99.99%	99.99%	99.99%	99.96%	99.96%	99.96%
HG005	99.99%	99.99%	99.99%	99.96%	99.97%	99.97%

3.3 ZBOLT performance in population genomic study

To efficiently analyze the population study consisting of 5,616 high-depth sequencing samples, we deployed the MegaBOLT pipeline on 46 ZBOLT computing servers to run in parallel. With excellent storage density and well-designed layout, the 7.2 Pb storage system, including computing servers and storage devices, was aggregated into only 22U cabinet space, with 16U occupied by the storage device. The significant reduction in cabinet space, compared with conventional methods, alleviates the demands for server room space in big data research.

Data transfer rate is crucial in the big data analysis.³³ The ZBOLT system employs multiple 100GB/s transmission channels, optimizing the aggregate bandwidth to 33GB/s. Upon preparing the system, the 1.16Pbp FASTQ files of 5616 WGS samples were transferred to the system for WGS analysis.

The analysis from FASTQ to VCF was successfully completed in 11.6 days, processing over 2.5 Pb of input and output data. On average, the system analyzed 100Tbp of raw sequencing data per day. After the analysis, the energy consumption is measured, with a total consumption of 11,306 kWh and an average consumption of 0.98 kWh per 100Gbp. In other words, the analysis of a WGS sample with 30X sequencing depth consumed approximately 1 kWh.

Following the analysis, we calculated the population genomics results. The mean sequencing depth was 68.7X, covering 99.52% of the genome. The mean transition-to-transversion ratio (Ts/Tv) was 1.951 for all SNPs. Representative summary statistics are shown in Figure 4. In total, we called 110 366 736 variants, among which 101 778 160 were SNPs (92.2%), 9 326 668 were Indels, 1 098 959 (0.996%) variants were located in the coding region (Table 4). Overall, the statistic metrics indicated that the population genomics analysis was performed as expected and yielded normal results.

TABLE 4. Variants statistics of 5616 individuals.

Consequence	Number of variants
Total variants	110 366 736
SNPs	101 778 160
Indels	9 326 668
Coding variation	1 098 959
Coding sequence variant	743
Frameshift variant	25 762
Incomplete terminal codon variant	26
Inframe deletion	10 115
Inframe insertion	4813
Missense variant	663 556
Protein altering variant	285
Start lost	1623
Start retained variant	23
Stop gained	17 062
Stop lost	655
Stop retained variant	336
Synonymous variant	373 960

4 DISCUSSION

In this study, we explored the ZBOLT analysis pipeline in population genomics studies. Adhering to GATK Best Practices, the ZBOLT analysis system enables a streamlined analysis by running a single command specifying the GATK resource bundle release, eliminating the need for more complicated operations, such as modifying the configure file. In comparison to GATK Best Practices, ZBOLT accelerates the whole analysis process by 10–20 times without sacrificing accuracy.

We assessed the accuracy of ZBOLT using the GIAB benchmark and concordance with GATK Best Practices outputs. Our results demonstrate high accuracy, comparable to that of GATK Best Practices. Although the evaluated results of replicated and controlled experiments showed the same metrics (precision, sensitivity, F-measure), some differences were observed in VCF metrics, these discrepancies were located outside high-confidence region, therefore not affecting the overall evaluation of the pipeline. Furthermore, the concordance evaluation result indicated that ZBOLT is consistent with GATK Best Practices. Overall, our findings suggest that ZBOLT is an accurate and stable analysis system, well-aligned with GATK Best Practices.

In the application of ZBOLT to a newborn genomic study, the system demonstrated its efficiency by analyzing 100Tbp sequencing data per day, highlighting its excellent performance in large-scale population genomic studies. In addition, with a power consumption of 1 kWh per 100Gbp sequencing sample, ZBOLT exhibits advantages in power usage effectiveness,³⁴ making it a more eco-friendly and low-carbon option. The bijou cabinet space required for ZBOLT installation further promotes its widespread application across various population genomics research fields. As an important measure for evaluating big data computing centers, performance, power, area metrics demonstrate that ZBOLT is an exceptional analysis model in the field of population genomics big data research.

Considering the system's throughput and costs, ZBOLT presents significant advantages when applied to population genomic studies comprising more than 5000 WGS, especially for projects involving upwards of 10 000 at 30X coverage.

Currently, the ZBOLT joint calling module is under development and optimization. Upon completion, it will be applied to the joint population analysis. In addition to the analysis pipeline, ZBOLT has developed an integrated system, ZLIMS, which enables automated scheduling of sample extraction, sequencing and analysis. The system can be used in conjunction with a sequencer to tightly integrate sequencing and analysis. Once sequencing is completed, the analysis is automatically initiated, streamlining the research process. Future research directions include innovations in data transmission technology based on compression, which will further accelerate data processing.

Compared with cloud computing solutions, the on-premise ZBOLT analysis system ensures data security and privacy. Moreover, ZBOLT is not reliant on a stable, high-speed network for data transfer, task scheduling and computing, which can be a limiting factor for cloud computing solutions. Users of the ZBOLT system can flexibly control all the software, hardware and data to meet various analysis needs.

To the best of our knowledge, ZBOLT is the first analysis solution specially designed for population genomics study that considers accuracy, efficiency and power consumption. The introduction of ZBOLT is expected to significantly advance population genomics, genetic disease diagnosis and precision medicine.

AUTHOR CONTRIBUTIONS

XJ, MF, BW, and RS conceived and organized this overall population genomic study. MF, YX and ZL designed the detailed research program. YG, XW, GH, and ZL collected the samples and conducted sequencing. YX, SG, JW, WH, LL, JT, QL, XZ, and YZ deployed the ZBOLT system, analyzed the WGS data. ZL, WH, YH carried out the statistical analysis including the evaluation of all outcomes. WZ, LL, XY, and RZ provided relevant computing supports. ZL, XW, and YH wrote the manuscript. MF, YX, and XJ revised the manuscript. All authors read and approved the final manuscript.

ACKNOWLEDGEMENTS

This study was supported by the National Key Research and Development Program of China (grant number: 2022YFC27031020), National Natural Science Foundation of China (grant numbers: 32171441 and 32000398), Natural Science Foundation of Guangdong Province, China (grant number: 2017A030306026), Guangdong-Hong Kong Joint Laboratory on Immunological and Genetic Kidney Diseases (grant number: 2019B121205005), the specific research fund of The Innovation Platform for Academicians of Hainan Province (grant number: YSPTZX202118). This work was supported by China National GeneBank (CNGB). The graphical abstract was created with Biorender.com.

CONFLICT OF INTEREST STATEMENT

The authors declare no potential conflict of interest.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE

The studies involving human participants were reviewed and approved by the Ethical Clearance the Institutional Review Board of BGI (BGI-IRB 20064). All participants' legal guardians had signed the informed consent form.

Open Research

DATA AVAILABILITY STATEMENT

The HG001-HG005 datasets used for the accuracy evaluation are publicly available and can be accessed and downloaded through GIAB (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/). The detailed data path is provided in the Table S1.

Supporting Information

REFERENCES

1Collins FS, Morgan M, Patrinos A. The human genome project: lessons from large-scale biology. Science. 2003; 300: 286-290.
10.1126/science.1084564
CAS PubMed Web of Science® Google Scholar
2Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet. 2008; 24: 133-141.
10.1016/j.tig.2007.12.007
CAS PubMed Web of Science® Google Scholar
3Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010; 327: 78-81.
10.1126/science.1181498
CAS PubMed Web of Science® Google Scholar
4Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014; 30: 418-426.
10.1016/j.tig.2014.07.001
PubMed Web of Science® Google Scholar
5Pereira R, Oliveira J, Sousa M. Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics. J Clin Med. 2020; 9 :132.
10.3390/jcm9010132
CAS PubMed Web of Science® Google Scholar
6Stark Z, Dolman L, Manolio TA, et al. Integrating genomics into healthcare: a global responsibility. Am J Hum Genet. 2019; 104: 13-20.
10.1016/j.ajhg.2018.11.014
CAS PubMed Web of Science® Google Scholar
7Sudlow C, Gallacher J, Allen N, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015; 12: 1-10.
10.1371/journal.pmed.1001779
Web of Science® Google Scholar
8Halldorsson BV, Eggertsson HP, Moore KHS, et al. The sequences of 150,119 genomes in the UK Biobank. Nature. 2022; 607: 732-740.
10.1038/s41586-022-04965-x
CAS PubMed Web of Science® Google Scholar
9Taliun D, Harris DN, Kessler MD, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed program. Nature. 2021; 590: 290-299.
10.1038/s41586-021-03205-y
CAS PubMed Web of Science® Google Scholar
10Denny JC, Rutter JL, Goldstein DB, et al. The ‘all of us’ research program. N Engl J Med. 2019; 381: 668-676.
10.1056/NEJMsr1809937
PubMed Web of Science® Google Scholar
11Muir P, Li S, Lou S, et al. The real cost of sequencing: scaling computation to keep pace with data generation. Genome Biol. 2016; 17: 1-9.
PubMed Web of Science® Google Scholar
12Van der Auwera GA, Carneiro MO, Hartl C, et al. From fastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013; 43(1110):11.
PubMed Google Scholar
13Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010; 26: 589-595.
10.1093/bioinformatics/btp698
CAS PubMed Web of Science® Google Scholar
14McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010; 20: 1297-1303.
10.1101/gr.107524.110
CAS PubMed Web of Science® Google Scholar
15DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011; 43: 491-498.
10.1038/ng.806
CAS PubMed Web of Science® Google Scholar
16Miller NA, Farrow EG, Gibson M, et al. A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med. 2015; 7: 100.
10.1186/s13073-015-0221-8
PubMed Web of Science® Google Scholar
17Kendig KI, Baheti S, Bockol MA, et al. SentIeon DNaSeq variant calling workflow demonstrates strong computational performance and accuracy. Front Genet. 2019; 10: 1-7.
10.3389/fgene.2019.00736
PubMed Web of Science® Google Scholar
18Poplin R, Chang PC, Alexander D, et al. A universal snp and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018; 36: 983.
10.1038/nbt.4235
CAS PubMed Web of Science® Google Scholar
19Lin YL, Chang PC, Hsu C. Comparison of GATK and DeepVariant by trio sequencing. Sci Rep. 2022; 12: 8-13.
CAS PubMed Web of Science® Google Scholar
20Zhao S, Agafonov O, Azab A, Stokowy T, Hovig E. Accuracy and efficiency of germline variant calling pipelines for human genome data. Sci Rep. 2020; 10: 1-12.
10.1038/s41598-020-77218-4
CAS PubMed Web of Science® Google Scholar
21Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today. 2017; 22: 712-717.
10.1016/j.drudis.2017.01.014
CAS PubMed Web of Science® Google Scholar
22Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nat Rev Genet. 2010; 11: 647-657.
10.1038/nrg2857
CAS PubMed Web of Science® Google Scholar
23Supernat A, Vidarsson OV, Steen VM, Stokowy T. Comparison of three variant callers for human whole genome sequencing. Sci Rep. 2018; 8: 1-6.
10.1038/s41598-018-36177-7
PubMed Web of Science® Google Scholar
24Yang CH, Zeng JW, Liu CY, Hung SH. Accelerating variant calling with parallelized DeepVariant. Proceedings of the International Conference on Research in Adaptive and Convergent Systems, October 13-16, 2020, Gwangju, Republic of Korea. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3400286.3418243
10.1145/3400286.3418243
Google Scholar
25Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009; 38: 1767-1771.
10.1093/nar/gkp1137
CAS PubMed Web of Science® Google Scholar
26Zook JM, McDaniel J, Olson ND, et al. An open resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019; 37: 561-566.
10.1038/s41587-019-0074-6
CAS PubMed Web of Science® Google Scholar
27Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics. 2011; 27: 2156-2158.
10.1093/bioinformatics/btr330
CAS PubMed Web of Science® Google Scholar
28Li H, Handsaker B, Wysoker A. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25: 2078-2079.
10.1093/bioinformatics/btp352
CAS PubMed Web of Science® Google Scholar
29Chen Y, Chen Y, Shi C, et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience. 2018; 7: 1-6.
10.1093/gigascience/gix120
Web of Science® Google Scholar
30Benjamin D, Sato T, Lichtenstein L. Calling somatic SNVs and indels with Mutect2. bioRxiv. 2019. https://doi.org/10.1101/861054
10.1101/861054
Google Scholar
31Joo Y, McKeown N. Doubling memory bandwidth for network buffers. Proc—IEEE INFOCOM. 1998; 2: 808-815.
Google Scholar
32Patterson DA, Gibson G, Katz RH. A case for redundant arrays of inexpensive disks (RAID). ACM SIGMOD Rec. 1998; 17: 109-116.
10.1145/971701.50214
Google Scholar
33Kashyap H, Ahmed HA, Hoque N, Roy S, Bhattacharyya DK. Big data analytics in bioinformatics: architectures, techniques, tools and issues. Netw Model Anal Heal Informatics Bioinforma. 2016; 5: 28.
10.1007/s13721-016-0135-4
Web of Science® Google Scholar
34Grealey J, Lannelongue L, Saw WY. The carbon footprint of bioinformatics. Mol Biol Evol. 2022; 39: 1-15.
10.1093/molbev/msac034
Web of Science® Google Scholar

Volume3, Issue6

December 2023

e252

An efficient large-scale whole-genome sequencing analyses practice with an average daily analysis of 100Tbp: ZBOLT