Development of Intragenic InDel Markers From RNA-Seq Data for Map-Based Cloning and Marker-Assisted Selection in Maize (Zea mays L.)
Funding: This work was supported by the Major Science and Technology Project of Sichuan Province (2022ZDZX0013), the National Key Research and Development Program of China (2023YFD1201100), and the Natural Science Foundation Project of Sichuan Provincial (2022NSFSC0151).
Sijia Yang and Zhiqin Liu contributed equally to the work.
ABSTRACT
Maize is one of the most extensively grown and produced crops in the world. Functional molecular markers (FMM) are essential tools for enhancing the efficiency of both breeding and basic research. The transcriptome obtained from high-throughput sequencing is a valuable resource for developing FMM. In this study, we precisely identified high-density and highly polymorphic RNA-derived InDel markers from 368 high-quality, high read depth maize RNA-seq datasets using stringent criteria. The nonredundant marker dataset comprised 13,904 InDel markers with a Polymorphism Information Content (PIC) ≥ 0.5 and a principal allelic difference (length difference between the most and second most frequent alleles) ≥ 3 bp, covering 7749 genes. Subsequently, the polymorphism of InDel markers was verified experimentally. Fifty InDel markers were randomly selected, and their average PIC values in 20 maize inbred lines selected from 368 RNA-seq samples and 20 newly selected maize inbred lines were 0.53 and 0.54, with range intervals of 0.18 to 0.77 and 0.22 to 0.74, respectively. The high density and polymorphism of RNA-derived InDel markers identified in this study make them effective tools for maize gene cloning, genetic diversity studies, genome-wide association analysis, and marker-assisted selection.
1 Introduction
Maize, one of the most widely distributed crops globally, serves as a major source of feed, industrial raw materials, and food. However, the escalating environmental challenges and diminishing arable land (Juroszek and Tiedemann 2013), juxtaposed with the increasing demand for maize, underscore the imperative of enhancing breeding efficiency as a means to bolster maize yield. Traditional breeding techniques have reached a bottleneck in improving maize yields. Molecular marker technology, a revolutionary tool that has emerged in recent years, holds promise for improving crop breeding efficiency (Grover and Sharma 2016). Over the past decades, molecular markers have found successful applications in various facets of maize breeding, encompassing genetic diversity assessment, construction of genetic linkage maps, map cloning, and marker-assisted selection for target genes/QTL.
Various DNA markers have been developed in recent decades to distinguish DNA variants, such as differences in base composition or length. Several DNA markers have been developed to distinguish differences in DNA sequence composition (Khlestkina 2014), including restriction fragment length polymorphism (RFLP), amplified fragment length polymorphism (AFLP), and single-nucleotide polymorphism markers (SNPs) (Mammadov et al. 2012a). Additionally, markers such as randomly amplified polymorphic DNA (RAPD), simple sequence repeat DNA (SSR) (Sharopova et al. 2002), and insertion/deletion (InDel) markers (Yang et al. 2016), derived from length differences, have been developed. Marker identification techniques with varying throughputs have been devised to accommodate the distinctive features of these markers. Low-throughput identification techniques comprise RFLP markers based on enzymatic digestion and DNA hybridization, RAPD, SSR, and InDel markers based on gel electrophoresis and PCR, and KASP, which identifies SNPs through PCR amplification of fluorescent moieties. With technological advancements, several high-throughput identification techniques, such as second-generation sequencing-based platforms (Grada and Weinbrecht 2013) (e.g., Roche Genome Sequencer FLX System, Illumina Genome Analyzer and ABI Ion Proton system) and microarray-based SNP identification methods (Ganal et al. 2011; Mammadov et al. 2012b; Tian et al. 2015), have been developed and widely adopted in research. However, length polymorphism molecular marker techniques based on PCR and gel electrophoresis remain popular due to their simplicity, cost-effectiveness and accessibility, especially markers based on agarose electrophoresis techniques. Among these, InDel markers have emerged as the mainstream length polymorphism markers due to their high density.
Markers developed based on length differences are widely utilized because their variations are easy to detect. For instance, SSR markers have long been utilized extensively, primarily because of their high variability. Nevertheless, their density constrains their application in various studies, such as fine mapping and molecular marker-assisted selection (MAS). InDel, being the most prevalent structural variation in the genome, holds great promise as an excellent molecular marker. However, the limited development of highly polymorphic InDel markers hinders their widespread application (Shete, Tiwari, and Elston 2000). Historically, the development of InDel markers primarily relied on traditional Sanger sequencing technology, which is slow and costly, thus inhibiting large-scale development. With the advent of second-generation sequencing technologies, rapid and cost-effective sequencers enabling the sequencing of genomes from a large number of individuals of the same species simultaneously with high accuracy (El-Metwally, Ouda, and Helmy 2014). Consequently, the efficiency of InDel locus identification has been significantly enhanced. These data furnish a high-quality resource for developing efficient InDel markers. For example, Bhattramakki et al. (2002) resequenced 502 maize loci from eight maize inbred lines and identified 655 indel markers; Zhou et al. (2016) found 25,847 InDel sites distributed on 10 chromosomes by genome-wide structural variation assessment using sequencing data from 327 maize inbred lines; a total of 1,973,746 unique indels were identified in 345 maize genomes by next-generation sequencing by Liu et al. (2015); Wang et al. (2024) identified a total of 89 SNPs and 11 InDels in 176 maize samples by analysing the high-throughput sequencing data.
The aforementioned molecular markers primarily originate from intergenic regions, where they are utilized to obtain indirectly related information. In MAS (Hasan et al. 2021), markers linked to target genes are selected rather than the genes themselves, leading to the possibility of recombination between the markers and target genes, resulting in the selection of markers without the selection of target genes. Conversely, markers located within genes enable direct selection of genes, thereby avoiding recombination between markers and genes and enhancing selection efficiency. When assessing genetic diversity, intragenic molecular markers directly reflect the genetic variation of genes, providing a stronger functional correlation and greater accuracy than intergenic markers. Among these markers distributed in different regions of genes, those distributed in promoter and coding regions are more likely to be related to functions. In the coding region, an InDel with a base count that is not a multiple of three will cause a frame shift, which will affect protein structure, ultimately leading to changes in gene function. However, gene regions, especially the coding regions of genes, are often conserved, resulting in a limited number of available intragenic InDel markers. There is an urgent need to develop high-density intragenic molecular markers.
Although numerous InDel molecular markers have been developed for genetic research in maize, there is still a lack of intragenic InDel molecular markers developed derived from high-quality sequencing data. In this study, leveraging 368 high-quality maize RNA-seq sequencing datasets and our previously developed InDel marker development method, we successfully screened high-density and highly polymorphic RNA-derived InDel markers. The application of these markers will significantly enhance the efficiency of breeding and genetic research, especially in map cloning and molecular marker-assisted selection.
2 Methods and Materials
2.1 Genomic Data and Maize Materials
The reference genomes B73 (RefGen_v2, v4 and v5) and gene annotation information (5a and 5b) were obtained from the website (https://maizegdb.org). The 368 RNA-Seq data of developing maize kernels were downloaded from the NCBI website (http://www.ncbi.nlm.nih.gov/sra/?term=SRP026161) and were utilized to mine InDels (Table S1).
A total of 40 samples were selected to assess polymorphism, including 20 maize inbred lines randomly chosen from the 368 RNA-seq samples (U8112, P178, DAN599, LX9801, JIAO51, W138, ES40, K22, DH3732, B73, MO17, DAN3130, SI273, S37, 18-599, YE478, QI205, DAN340, JH59 and ZHEG29) and 20 newly selected tropical and subtropical maize inbred lines (SCML203, SCML202, SAM3001, PH6WC, CA1108, 2142, 891, 78599-211, WA-1, YA8201, CA211, YA3237, SCML103, 21A, 9953, 08-641, Liao6082, JH961, SN8-1-1 and 9614).
2.2 Methods
2.2.1 InDel Marker Development Method
The InDel identification method used in this study was described in previous studies by our research group (Liu et al. 2015). In previous studies, an electronic PCR strategy was employed instead of mapping reads to the reference genome to excavate InDels. This method primarily consisted of three steps. First, a set of e-PCR primer sequences covering the entire genome was designed iteratively using the reference genome sequence. The primer length was 20 bp, and the amplicon length was 60 bp. Subsequently, the electronic PCR strategy was applied to align the primer sequences with the reference genome to identifiy unique-locus primers, allowing a maximum of three mismatches during alignment. Increasing the mismatch rate during alignment to the reference genome enhances can increase the probability of identifying a unique-locus in other maize inbred lines, thereby enhancing the accuracy of the experiment. Finally, using the NGSQC tool v2.3.3 to filter out low-quality reads with Q20 values, the electronic PCR strategy was employed to align unique-locus primers to the RNA-seq sequencing reads of 368 maize inbred lines (Patel and Jain 2012). Lastly, primers with multiple electronic PCR amplicon lengths in a sample were removed.
2.2.2 Polymorphism Evaluation of InDel Markers
InDel loci with at least 20 genotypes per allele were further analysed. The polymorphism information content (PIC) of each InDel locus was evaluated to assess the allelic polymorphism (Serrote et al. 2020), calculated using the formula PICi = , where pij is the frequency of the jth pattern of the ith marker.
2.2.3 Selection of Highly Polymorphic Markers
The selection criteria for InDel markers were (1) each marker having over 20 genotype data; (2) PIC ≥ 0.5; (3) mapped to a unique location in the reference genome with no more than three mismatched bases during alignment; and (4) principal allelic difference (i.e., the size difference between the most common and the second most common alleles) ≥ 3 bp.
2.2.4 Gene Ontology (GO) Analysis
GO analysis was performed on genes with InDel marker densities higher than one marker every 2 kb (per kilobase). GO analysis and enrichment were conducted using the online Agrigo GO analysis toolkit (Agrigo) (Du et al. 2010). Enrichment analysis was performed using SEA, with the maize reference genome (RefGen_v4) as the reference. The Benjamini-Hochberg false discovery rate (FDR) was used for hypergeometric testing, with default parameter adjustment of p values.
2.2.5 PCR Primer Design
A perl program was employed to extract a total of 200-bp region, comprising a 20-bp InDel region and 90 bp flanking sequences on each side of the InDel region, for designing PCR primers using the Primer3 software. The primer length ranged from 18 to 24 nt, with an optimal length of 22 nt, and the melting temperature (Tm) ranged from 60°C to 64°C, with an optimal temperature of 62°C. The length of PCR products ranged from 60 to 200 bp, with a perfect G or C-rich 3′ end, and the optimal length of PCR products was between 60 and 100 bp.
2.2.6 Experimental Validation of InDel Polymorphism
Fifty highly polymorphic InDel markers (PIC ≥ 0.5, principal allelic difference ≥ 3 bp) were randomly chosen for polymorphism validation. Genomic DNA was successfully extracted from 3-week-old seedlings of 20 maize inbred lines and 20 newly selected maize inbred lines, respectively, using an improved CTAB (cetyltrimethylammonium bromide) DNA extraction protocol. PCR amplification of InDel loci was conducted in a 20-μL reaction mixture containing 50 ng of DNA, 2.0 μL of 10 × buffer (Mg2+), 3.0 μL of dNTPs (2.5 mM), 100 nM of each primer, 2 U of Taq polymerase and ddH2O. The reaction was carried out in a BIO-RAD CFX-96 RT-PCR thermal cycler with the following cycling parameters: initial denaturation at 95°C for 5 min, followed by 35 cycles of denaturation at 95°C for 30 s, annealing at 55°C for 90 s, extension at 72°C for 90 s and a final extension at 72°C for 10 min. PCR products for InDel genotyping analysis were electrophoresed on a 6.0% polyacrylamide gel. PIC values for InDel markers were calculated using the formula described earlier.
3 Results
3.1 Identification of InDel Variations
Using 1,796,644 unique electronic PCR primers, 360,010 (20.04%) InDel loci were identified from the sequencing data of 368 maize inbred lines, with a density of 174.88 InDel/Mbp distributed throughout the genome. The average number of alleles per InDel locus was 2.50, ranging from 2 to 74. The PIC values of InDel loci ranged from 0.01 to 0.98, with an average of 0.20. Among the PIC values, 49.12% were less than 0.1. There were 40,564 (13.39%) InDel loci with PIC ≥ 0.5, localized to 15,660 genes, with an average of approximately 2.59 InDel loci per gene, about one InDel locus per 7264 bp in gene regions and one InDel locus per approximately 50,749 bp in the genome. As the PIC values increased, the number of InDel markers gradually decreased (Figure 1). The length of InDels ranged from −20 to 60 bp, with an average of approximately 4.30 bp. In general, the number of InDels decreases as the size of the InDel increases, but the number of InDels of size 3 and integer multiples thereof increases significantly (Figure 2).


3.2 Identification of Highly Polymorphic InDel Markers
Using stringent criteria such as PIC ≥ 0.5 and principal allelic difference ≥ 3 bp, 13,904 high-quality InDel markers were identified, covering a total of 7749 genes. Primer sequences for amplicons ranging from 60 to 200 bp and 200 to 300 bp are provided in the Supporting Information, with different resolutions achieved by polyacrylamide and agarose gel electrophoresis. Furthermore, detailed information on InDel markers, including PIC values, GO annotations, primer sequences, physical positions and gene model on the genome (RefGen_v2, v4 and v5), number of alleles, number of genotypes in the population, and size and number of principal alleles, is provided in the Supporting Information. Additionally, 690 core InDel markers were selected, such as PIC ≥ 0.6 and principal allelic difference ≥ 8 bp (Table S2).
3.3 Validation of InDel Marker Polymorphism
To validate the polymorphism of InDel markers, 50 highly polymorphic InDel markers were randomly selected for polymorphism validation using genomic DNA from 20 resequenced maize inbred lines and 20 new maize inbred lines as templates. The average PIC values were 0.53 and 0.54, ranging from 0.18 to 0.77 and 0.22 to 0.74, with average allele numbers of 2.92 and 3.08, ranging from 2 to 6 and 2 to 5, respectively (Figure 3 and Table S3).

3.4 GO Analysis
There are 4303 genes containing high-density InDel markers with high polymorphism (> 1 InDel per 2 kb). Among them, a total of 2328 genes were annotated to 732 GO terms. One-hundred GO terms showed significant enrichment, including those related to biological processes (72 GO terms), cellular components (17 GO terms) and molecular functions (11 GO terms). These biological processes mainly involved biosynthetic processes, cellular biosynthetic processes and regulatory processes, particularly those related to nitrogen compound metabolic processes and responses to biotic and abiotic stresses, such as responses to nonliving stimuli and freezing, as well as inflammatory responses to antigen stimulation. Cellular components primarily included cellular parts and intracellular organelles, whereas molecular functions mainly involved ribosomal hydration, structural molecular activity and structural composition (Figure 4).

4 Discussion
4.1 A Variety of Molecular Marker Technologies
Over decades, advancements in molecular marker technologies have focused on reducing costs, enhancing user-friendliness and increasing throughput. Currently, there are two main marker systems in crops to meet various research needs. One system focuses on high-density SNP variant identification for individual samples, such as high-throughput sequencing and SNP genotyping arrays. High-throughput sequencing platforms can identify most SNP variants at once, whereas SNP arrays of various scales can provide genotype identification for hundreds, thousands, tens of thousands or even hundreds of thousands of SNP loci. The other system is primarily suitable for identifying a small number of marker loci for individual samples. PCR-based genotyping methods coupled with gel electrophoresis for length polymorphism markers are the most common genotyping techniques in biological laboratories due to their affordability, minimal equipment requirements and user-friendliness. InDel markers are the most ideal length polymorphism markers, combining the advantages of SSR markers with their high-density distribution in the genome, which compensates for the low density of SSRs. They have gradually become standard markers for genotype identification in laboratories. The application of these high-density, highly polymorphic InDel markers helps to improve the efficiency of map-based cloning and marker-assisted selection in maize.
4.2 Impact of Sequencing Quality and Depth on Marker Development
High-quality data are essential for obtaining reliable results. Currently, two common traditional methods aim to improve quality: filtering out low-quality reads that do not meet defined standards like Q20/Q30 and increasing sequencing depth per locus. During the early rise of NGS, low-coverage sequencing was employed for large-scale resequencing of individuals within a species, making a trade-off between cost and sequencing quantity. Several studies involving resequencing of hundreds of samples reported average coverage rates of 1–2×. With the rapid decrease in NGS costs, resequencing of 50 rice accessions, both cultivated and wild, at depths exceeding 10× resulted in 6.5 million high-quality SNPs (Xu et al. 2012). Currently, many whole-genome resequencing projects employ a depth of 10× or higher. Compared to whole-genome resequencing, transcriptome sequencing at the same depth achieves significantly higher coverage. Furthermore, the larger and more complex the genome, the greater the difference in coverage; for example, in maize, the difference exceeds 20-fold. This batch of high-quality RNA-seq based InDel markers developed in this study improves quality by evaluating and removing low-quality raw reads. The average depth per sample in this dataset exceeded 70×, marking the highest depth for maize sequence data to date. This coverage effectively ensured the quality of the InDel markers we developed. In terms of the number of highly polymorphic InDels developed in the mRNA region, it is nearly double the number of markers obtained in previous studies that used low sequencing depth data to develop genome-wide InDel markers. In addition to the difference in number, the average number of sequencing fragments supported per marker developed previously was about 1, compared with about 21 in this study. The higher depth of coverage and greater number of supported sequencing fragments ensured the accuracy of the developed markers.
4.3 Unique Locus and Conserved Primer Sequences
The unity and conservation level of flanking sequences at polymorphic loci affect PCR amplification efficiency. Nonspecific amplification impacts the accuracy of PCR amplification, ultimately leading to inaccurate experimental results. This issue is particularly pronounced in species with complex genomes, such as maize, where 85% of the genome comprises repetitive sequences (Schnable et al. 2009). The reference genome sequence provides the foundational data for identifying primer sequences that map to a unique locus in the genome. When developing markers, setting mismatch parameters during primer sequence alignment to the reference genome, allowing up to three base mismatches, increases the specificity of markers during PCR amplification. In validation experiments, nonspecific bands in PCR products were almost nonexistent. Conserved primer sequences enhance the success rate of PCR amplification. This study evaluated sequence conservation based on the number of allelic genotypes in 368 resequenced samples. Furthermore, conservation was improved by removing highly polymorphic variant sites in flanking sequences. In validation experiments, the PCR amplification success rate exceeded 89.3%.
4.4 User-Friendly Molecular Markers
In this study, a set of InDel molecular markers was identified from large-scale datasets to enable researchers to avoid dealing with extensive data. To facilitate selection, detailed information for each marker was provided and stored in Excel. Additionally, researchers can prioritize selection from core markers. Those detailed information ensures users choose suitable markers. For instance, appropriate gel types and concentrations can be selected based on size differences during identification. Biallelic markers offer several advantages, including easy genotype partitioning when fine mapping with biparental populations, comparison or integration analysis across different platforms and high-throughput genotyping.
4.5 Functional InDel Molecular Markers
Many sets of InDel molecular markers have been successfully developed by various researchers, including our team. These InDel markers have significantly enriched the molecular marker identification system in maize. However, these markers are randomly distributed, with only a portion located in gene regions. InDels located in genes are more likely to be directly associated with phenotypes, making them more efficient for genetic diversity evaluation and MAS studies. In maize, an InDel in the coding region of the dwarf8 gene promotes early flowering, thereby shortening the growth period and reducing plant height (Thornsberry et al. 2001). This marker has been utilized in maize MAS breeding. In rice, the Pi54 gene is associated with resistance to rice blast disease. Researchers identified a 144-bp InDel variation within this gene and designed a specific marker, Pi54 MAS, based on this variation (Ramkumar et al. 2011). In wheat breeding, InDel variations within the Lr34 gene, which confers resistance to multiple diseases including leaf rust, stripe rust and powdery mildew, have been effectively utilized to develop molecular markers. An example of such a marker is the CSSfr5 InDel marker, which has been specifically designed based on a 3-bp InDel polymorphism in the exon 11 of the Lr34 gene. This marker has shown high efficacy in distinguishing between resistant and susceptible wheat varieties, thereby facilitating marker-assisted selection in breeding programmes aimed at improving disease resistance (Krattinger et al. 2009). To date, not only corn but also nearly all crops lack the development of intragenic markers. The RNA-derived InDel markers we developed constitute the largest set of intragenic InDel markers for corn, covering over 7000 polymorphic genes. These markers can not only be applied to map cloning but also provide a powerful help for maize quality and yield enhancement, improve the efficiency of maize research and breeding and provide a reference for the development of intragenic InDel markers for other crops.
5 Conclusion
This study developed a set of genome-wide highly polymorphic RNA-derived InDel markers. Experimental verification results demonstrated that this set of markers exhibits high polymorphism. Each marker was accompanied by detailed information, enhancing the user-friendliness of this marker set. These InDel markers are not only beneficial for map-based gene cloning and MAS in maize research, but the method can also serve as a reference for marker development in other species.
Author Contributions
Sijia Yang: investigation, validation, data curation, visualization, writing–review and editing, writing–original draft. Zhiqin Liu: visualization, validation, investigation, data curation. Ting Liang: investigation, validation, data curation. Junle Wu: validation, investigation, data curation. Hang Mi: software, data curation, formal analysis. Ruifan Bao: investigation, validation. Pengxu Meng: investigation, validation. Jixing Ni: investigation, validation. Xueying Wang: investigation, validation. Wujiao Deng: investigation, validation. Haimei Wu: investigation, validation. Jinchang Yang: investigation, validation. Tingzhao Rong: conceptualization, methodology, resources. Jian Liu: conceptualization, methodology, project administration, supervision, funding acquisition, resources.
Acknowledgements
We are grateful for the financial support for this work from the Natural Science Foundation Project of Sichuan Provincial (Grant No. 2022NSFSC0151), the Major Science and Technology Project of Sichuan Province (Grant No. 2022ZDZX0013), and the National Key Research and Development Program of China (2023YFD1201100).
Conflicts of Interest
The authors declare no conflicts of interest.
Open Research
Data Availability Statement
The authors confirm that the major data supporting the findings of this study are available within the article and its supporting information.