RepeatProfiler: A pipeline for visualization and comparative analysis of repetitive DNA profiles
Abstract
Study of repetitive DNA elements in model organisms highlights the role of repetitive elements (REs) in many processes that drive genome evolution and phenotypic change. Because REs are much more dynamic than single-copy DNA, repetitive sequences can reveal signals of evolutionary history over short time scales that may not be evident in sequences from slower-evolving genomic regions. Many tools for studying REs are directed toward organisms with existing genomic resources, including genome assemblies and repeat libraries. However, signals in repeat variation may prove especially valuable in disentangling evolutionary histories in diverse non-model groups, for which genomic resources are limited. Here, we introduce RepeatProfiler, a tool for generating, visualizing, and comparing repetitive element DNA profiles from low-coverage, short-read sequence data. RepeatProfiler automates the generation and visualization of RE coverage depth profiles (RE profiles) and allows for statistical comparison of profile shape across samples. In addition, RepeatProfiler facilitates comparison of profiles by extracting signal from sequence variants across profiles which can then be analysed as molecular morphological characters using phylogenetic analysis. We validate RepeatProfiler with data sets from ground beetles (Bembidion), flies (Drosophila), and tomatoes (Solanum). We highlight the potential of RE profiles as a high-resolution data source for studies in species delimitation, comparative genomics, and repeat biology.
1 INTRODUCTION
Repetitive DNA elements (e.g., transposable elements, tandem repeats, high-copy genes) have been understudied for decades due to technical and computational challenges associated with their sequencing and assembly. As recent advances in sequencing technology overcome those obstacles, work in model organisms is shedding new light on the critical roles repetitive elements (REs) play in many processes including shifts in gene regulation that drive phenotypic change, rapid genome evolution, and mechanisms underlying reproductive isolation and speciation (Chuong et al., 2017; Ferree & Barbash, 2009; Niu et al., 2019; Schrader & Schmitz, 2019; Stuart et al., 2016). Because repetitive regions can evolve much more rapidly than unique DNA sequences (e.g., protein-coding genes), repetitive sequences can reveal signals of evolutionary history over short time scales that may not be evident in slower-evolving genomic regions.
Despite their critical roles in genome and phenotype evolution and their known rapid turnover between closely related species (e.g., Lower et al., 2017; Sproul, Khost, et al., 2020; Strachan et al., 1982; Tetreault & Ungerer, 2016; Ugarković & Plohl, 2002), repeat dynamics are seldom considered in evolutionary studies aiming to understand species boundaries and recent genome evolution. One caveat to using REs in evolutionary studies is that their abundance may fluctuate widely across samples, even below the species level (Bosco et al., 2007; McLain et al., 1987; Mestrović et al., 1998; Raskina et al., 2008). Within species, this variation may be largely caused by expansion and contraction of REs due to unequal exchange within recombining repeat clusters (Smith, 1976). Dynamic variation in repeat abundance may add noise when comparing patterns across samples, which can confound signals that may be present (Martín-Peciña et al., 2019). Development of approaches that can extract evolutionary signal despite repeat abundance variation can potentially increase the resolution of evolutionary studies among populations and species (Sproul, Barton, et al., 2020).
Many software tools are available for studying REs. Most tools rely on fully assembled genomes and repeat libraries (Goerner-Potvin & Bourque, 2018) and are therefore most useful in groups with extensive genomic resources. However, a number of “reference-free” tools enable discovery and annotation of REs in any group using low-coverage shotgun sequence data (Ewing, 2015; Nelson et al., 2017). Despite these advances, a general shortcoming of repeat software is that few programs have features that allow for quantitative comparison of repeat patterns across samples. Such comparative analysis will become increasingly important as the study of REs extends from few representative individuals to more extensive sampling (e.g., population-level sampling within species), which is important for understanding how REs shape genome evolution within and across species. In addition, REs present a special set of challenges to standard comparative genomics approaches (Fernandes et al., 2020). REs account for large fractions of missing sequence in assemblies—this remains true even in premier model organisms with “gold standard” genomic resources. It is difficult to conduct any comprehensive study of REs on incomplete assemblies. Although long-read technology is closing many of these gaps, the megabase-long blocks of tandem repeats that are common in many genomes remain a challenge to assemble for the foreseeable future. Furthermore, reads arising from repetitive sequences present multiple mapping issues (i.e., one read can map equally well to multiple places in the genome) adding uncertainty to analyses that rely on mapping reads to assembled REs. Approaches that measure repeat dynamics directly from low-coverage sequence data have advantages for detecting signatures of repeat dynamics that may not be evident in incomplete genome assemblies and for studies in groups with limited genomic resources.
Here we present RepeatProfiler as a tool for both exploration and comparative analysis of repetitive DNA element patterns with low-coverage, short-read data. The pipeline was motivated by a previous study by Sproul, Barton, et al. (2020), which highlighted the potential of RE profiles in species delimitation studies. RepeatProfiler maps reads to reference repeat sequences, generates enhanced read depth/copy number profiles for visualizing patterns, and facilitates statistical comparison of profiles within and among samples. The pipeline compares sources of profile variation (e.g., profile shape, and relative abundance of variants within profiles) that can be stable within species despite variation in absolute repeat abundance. Comparing profiles for the same repeat reference between multiple samples can reveal species-specific differences in sequence composition (e.g., substitutions, indels), relative abundance, truncations, and amplification of partial repeats throughout the genome. This approach to studying REs which is also used by other recent tools (e.g., DeviaTE [Weilguny & Kofler, 2019]) bypasses issues related to multiple mapping and incomplete genome assemblies because all genomic reads are mapped to reference sequences and the resulting profiles capture variation in the repeat sequence, regardless of their genomic distribution (Fernandes et al., 2020; Sproul et al., 2020). Studying profiles of multiple REs in a comparative framework can provide evidence of gene flow boundaries, highlight potential mechanisms underlying repeat evolution, and detect signatures of rapid genome evolution not evident in less dynamic components of the genome (Sproul, Barton, et al., 2020; Sproul & Maddison, 2018).
RepeatProfiler uses Bowtie2 (Langmead & Salzberg, 2012), SAMtools (Li et al., 2009), and python to map reads and process mapping output. It uses R (R Core Team, 2013) including the packages ggplot2 (Wickham, 2016a), ggpubr(Kassambara, 2017), reshape2 (Wickham, 2007), and scales (Wickham, 2016b) for data visualization and comparative analysis of profile features, and GNU parallel (Tange, 2018) for parallel processing. The pipeline itself is written in bash and published under GPL 3.0 license. RepeatProfiler is available for download from https://github.com/johnssproul/RepeatProfiler and can be installed on Unix, Linux, and Windows platforms with installation options through Homebrew and Docker to automate installation of dependencies.
2 MATERIALS AND METHODS
2.1 Overview
An overview of the pipeline workflow is provided in Figure 1. RepeatProfiler generates RE profiles by mapping input reads to reference sequences of REs of interest. Output includes two styles of enhanced read depth profiles (colour-enhanced and variant-enhanced) to facilitate comparison of patterns across samples and REs. Output also includes summary tables and results from comparative analysis. The pipeline design and parallelization at key steps enables efficient scaling to data sets that include thousands of references across hundreds of samples in a single run.

2.2 Input
RepeatProfiler has two input requirements: (i) short-read sequence data from one or more samples in fastq format; and (ii) a fasta file containing reference sequences of REs to be analysed. Short-read sequence data may include low-coverage (e.g., 0.1–10×) whole-genome shotgun data or reads produced from a target-capture sequencing approach (e.g., anchored hybrid enrichment or ultraconserved element approach). The latter assumes input reads contain off-target “background” reads such that profiles can be generated from off-target reads while ignoring reads from enriched targets. Repeat reference sequences may be obtained from existing repeat libraries or online databases (e.g., NCBI, Dfam). For organisms lacking reference libraries and/or representation in databases, reference sequences may be obtained using repeat assembly software such as RepeatExplorer (Novák et al., 2013, 2017) with possible workflows outlined in Supporting Information. The pipeline assumes input reads have undergone quality control (e.g., trimming and adapter removal) and downsampling (if desired) prior to analysis.
2.3 Generating profiles
The pipeline generates read depth profiles by mapping input reads to reference sequences using Bowtie2 (Langmead & Salzberg, 2012). We use Bowtie2 as it has been shown using real data to have higher rates of read mapping while still being faster than comparable programs (Langmead & Salzberg, 2012). The default settings of RepeatProfiler use Bowtie2 parameters that tolerate mismatches, given that users may be mapping reads from divergent species to a common reference; however, users can alter read mapping parameters for other uses of the tool. For cases where using a different read mapper is desired, users also have the option of feeding their own BAM files directly into the pipeline using the “-bam” flag. Following read mapping (or reading in user-supplied BAM files), BAM files are processed using SAMtools (Li et al., 2009), including the mpileup function which generates variant information at each site relative to the reference sequence. SAMtools output is then parsed to retain read depth information and simplify variant output prior to plotting for visualization.
2.4 Visualizing profiles
RepeatProfiler produces two types of enhanced read-depth profiles in R using the ggplot2 (Wickham, 2016a), ggpubr (Kassambara, 2017), scales (Wickham, 2016b), and reshape2(Wickham, 2007) packages. Colour-enhanced profiles use a colour gradient to provide a visual indication of read depth at each site using a scale that is standardized across samples and reference sequences. Variant-enhanced profiles provide a visual summary of sites that show sequence variants relative to the reference sequence. For both colour and variant-enhanced profiles we combine plots into a single, multiplot PDF to simplify visual comparison of patterns across samples and REs. For all profile types, users have the option to annotate features within profiles using the optional flag “-bed” and supplying a BED file with coordinates of features within reference sequences. Users can also use the “-indel” flag to visualize sites in the reference sequence where indels are present in mapped reads, provided the indel-containing reads surpass a coverage threshold for that site (which can be set by the user) (see Supporting Information Methods for additional description of the “-indel” flag).
2.5 Comparative analysis
2.5.1 Correlation analysis
Using an optional flag (-corr), the user can test the degree of correlation in profile shape (i.e., the pattern of coverage depth across a given reference sequence) within and among user-defined groups. Sample groups may represent different species, populations, sexes, tissues, or other groupings. The correlation analysis measures the degree of similarity in coverage depth for each position across the reference sequence between two samples. We use Spearman's rank correlation coefficient (or Spearman's Rho, denoted “ρ”) to calculate correlation values for pairwise comparisons of all samples for a given reference sequence. We use Spearman's as opposed to Pearson's correlation because the latter includes an assumption that the two profiles being compared have a linear relationship. When comparing profiles of a fast-evolving repetitive element from two different species, differences in terminal truncations, internal deletions, or proliferation of fragment of the larger element could all lead to violation of the linearity assumption. Spearman's correlation thus offers more flexibility for comparing RE profiles.
2.5.2 Phylogenetic analysis of variant signatures
RepeatProfiler also facilitates profile comparison in a phylogenetic framework. The pipeline summarizes information contained in variant-enhanced profiles by identifying abundant variants and encoding those variants as molecular-morphological characters. The pipeline outputs phylip files with the encoded pattern of variants (relative to the reference sequence) which the user can then analyse as morphological data using phylogenetic software. If the user analyses multiple REs in the run, trees can be produced from the output of each repeat and summarized as a consensus tree to combine signal from all REs into one tree. The intent behind this feature is not to infer phylogenetic relationships among samples per se (although our validation below suggests the pattern of variants can accurately resolve phylogenetic relationships over short times scales), but rather to take advantage of the statistical framework of phylogenetic analysis to group profiles that have similar variant signatures, and to indicate the strength of that grouping (as indicated by nodal support). Similar to the correlation analysis, this approach is robust to absolute differences in read depth as it relies on the relative abundance of variants within a sample, rather than absolute values which vary due to many factors described above. This feature may be used to test whether signal from profiles supports a priori grouping of samples or to generate grouping hypotheses in the absence of existing data. Additional details regarding the implementation of this feature are provided in Supporting Information.
2.5.3 Single-copy normalization
As an optional feature, the pipeline conducts normalization of read depth across samples relative to single-copy genes. This feature is useful if the user desires to make inferences about relative repeat copy number differences within or across samples. By choosing the “-singlecopy” flag, the pipeline maps reads to user provided single-copy genes, estimates their average coverage, and calculates a normalized value of repeat coverage across samples based on single-copy estimates. Bases near the edges of a reference sequence are expected to show reduced coverage as an artefact of the read mapping algorithm, thereby underestimating coverage of single-copy genes (Pflug et al., 2019) which would lead to overestimation of repeat copy number. To address this problem, the pipeline calculates a corrected value of single-copy coverage by excluding data near reference sequence edges (length of excluded sequence =1/2 of read length at each end of the reference) prior to calculating average coverage. The values from the correlation analysis are unaffected by this normalization since that analysis only considers ranks of coverage at each site.
2.6 Validation methods
We validated the pipeline using previously published short-read data from ground beetles, tomatoes, and Drosophila, and simulated data from Drosophila. RepeatProfiler automates a general workflow of using RE profiles in evolutionary studies presented in Sproul and Maddison (2018) and Sproul, Barton, et al. (2020). Those studies explored reference bias, profile stability across varying read depth, tested whether profiles could be obtained from targeted sequencing workflows (i.e., hybrid capture) and museum specimens, and used fluorescence in situ hybridization to relate information in profiles back to repeat architecture on chromosomes. Our validation here complements and extends those findings. A list of data sets and species names used for the additional validation is provided in Table S1.
2.6.1 Read mapping parameter sensitivity analysis and normalization estimates
Parameter sensitivity analysis
We analysed the effects of changing Bowtie2 parameters to determine ideal default settings for RepeatProfiler and to orient users as to the impact of parameter settings on resulting profiles. We generated profiles using Bowtie2’s presets: “very-fast”, “fast”, “sensitive”, and “very-sensitive” with “end-to-end” (Bowtie 2’s default) and “local” mapping strategies. We conducted this analysis in a ground beetle species, Bembidion breve, using the 28S ribosomal RNA gene as a reference sequence. This species has a recent history of ribosomal DNA (rDNA) mobilization in which a fragment of 28S rDNA escaped functional rDNA units, proliferated, and spread throughout the genome where it now evolves separately (i.e., not in concert with functional rDNA) (Sproul, Barton, et al., 2020). Thus, this species and reference sequence pair is ideal for understanding the effect of changing mapping parameters on both highly conserved (i.e., functional 28S rDNA) and more divergent (abundant rDNA-like fragments) reads relative to the same reference sequence.
Normalization estimates
An advantage of RepeatProfiler's approach to studying repeat dynamics is that informative profiles can be obtained with very low-coverage sequence data (e.g., much less than 1× coverage for abundant REs). Because normalization relies on mapping reads to single-copy genes, we tested whether accurate single-copy estimates could be obtained with low-coverage data by running the pipeline across a series of downsampled data sets ranging from 25 million to 0.5 million reads (or 4× to ~0.1× coverage). We generated downsampled data sets using seqtk (https://github.com/lh3/seqtk) and mapped reads to 10, single-copy genes in two size classes (i.e., 450 bp and 900 bp references) using the pipeline. We plotted the coverage estimates produced by each set of input reads to evaluate the stability of the estimate trend line at low coverage using R.
2.6.2 Read mapping and reference sequence divergence
We investigated the effect of sequence divergence on read mapping rates by simulating divergent reference sequences in four closely-related Drosophila species (D. melanogaster, D. simulans, D. sechellia, and D. mauritiana) from species-specific reference sequences. We used Slim 2 v.3.4 (Haller & Messer, 2017) to simulate sequence divergence over 2000 generations. We generated a sequence every 200 generations resulting in a total of 10 new sequences with increasing sequence divergence such that the tenth showed ~25% pairwise divergence relative to the original sequence. We repeated simulations for two reference sequences (portions of the ribosomal DNA external transcribed spacer and the 28S ribosomal RNA gene) in each of the four species, ran the divergent reference sequences through the pipeline, and plotted the percentage of reads that mapped relative to the unmutated, conspecific reference sequence for each species.
2.6.3 Phylogenetic validation
We validated our approach to condensing patterns in variant-profiles into molecular-morphological characters for phylogenetic analysis using empirical data from Drosophila, ground beetles, and tomatoes. We present a detailed validation using Drosophila because D. melanogaster and its close relatives are a group with: a very recent history of divergence (with known dates) (Garrigan et al., 2012), well-established phylogenetic relationships, abundant genomic resources (i.e., sequence data and repeat libraries), and several REs with a recent history of dynamic activity (Jagannathan et al., 2017; Larracuente, 2014; Lohe & Roberts, 1988; Sproul, Khost, et al., 2020). We obtained consensus sequences of 59 abundant transposable elements (TEs) including LTR and non-LTR retrotransposons from a custom Drosophila repeat library (Chakraborty et al., 2020; Sproul, Khost, et al., 2020) and generated profiles from these TEs in a run that included four closely related Drosophila species (D. melanogaster, D. simulans, D. sechellia, and outgroup D. erecta), with 2–4 replicate individuals from each species. The ingroup species diversified in the last 2.5 million years and two species (D. simulans and D. sechellia) are only separated by an estimated ~240 k years (Garrigan et al., 2012). To reduce missing data in downstream analysis, we filtered run output to exclude any TEs with <70% average coverage of bases in the reference sequence for all samples. For the 37 remaining high-coverage TEs, we analysed the PHYLIP file produced by the pipeline in IQ-TREE (Nguyen et al., 2015) as morphological data using an MK model. We processed resulting trees to test: (i) How many of the five expected clades are present with greater than 50% bootstrap support; (ii) whether the branching pattern among those clades matches the species phylogeny; (iii) whether a lack of expected clades is due to unresolved groupings of the closest sister pairs (i.e., D. simulans + D. sechellia) or due to clades that group nonsister taxa; and (iv) whether any clades show >50% bootstrap support for relationships discordant with the species tree. In addition to analysing individual gene trees we generated a consensus tree in IQ-TREE to evaluate the combined signal from all trees.
3 RESULTS
3.1 Standard output
Enhanced RE profiles produced by the pipeline (Figures 2 and 3) can provide a high-resolution data source for evolutionary studies over short time scales. The characteristics of individual profiles can reveal species-specific signatures in both the pattern of coverage depth across the reference sequence (i.e., profile shape) (Figures 2a,b and S1) and the signature of sequence variants relative to the reference (Figures 3b, S2 and S3).


In addition, profiles can show signatures that lend insight into repeat biology such as 5′ truncations in active non-LTR retroelements (Figure S1) or male-female differences that can indicate differential distribution of REs on sex chromosomes. The visual enhancements of colour and variant-enhanced profiles simplify quick comparison across samples and REs to understand general patterns while the comparative analysis features of the pipeline enable quantitative comparison for patterns of interest. Standard output also includes a table with summary statistics of coverage across samples and references. Optional output features let users annotate specific coordinates for a given reference sequence (i.e., “-bed” flag) and annotate sites with indels (i.e., “-indel” flag), allowing users to visualize additional information layers alongside profiles which can simplify the interpretation of observed patterns.
3.2 Comparative analysis of profiles
For multisample runs that include the correlation analysis flag (-corr), several additional output files are generated. For each reference sequence the program produces boxplots that summarize the distribution of pairwise correlation values (i.e., Spearman's Rho) for within and between-group comparison of profile shape (e.g., Figure 2c). An additional boxplot chart is produced that summarizes overall correlation patterns observed across references for each user group. Output also includes a histogram of correlation values within and between groups for each reference sequence. The program saves matrices containing raw correlation analysis output in addition to a summary table organized by reference sequence. For phylogenetic comparisons, the program outputs a PHYLIP-formatted file (Figure 3c) for each reference sequence that summarizes variants within profiles as molecular-morphological characters.
3.3 Validation results
3.3.1 Read mapping parameter sensitivity analysis and normalization estimates
The Bowtie2 parameter sensitivity analysis showed only minor differences in the mapping rate of highly conserved reads across Bowtie2 presets ranging from “very-fast” to “very-sensitive”, with each run showing ~500X coverage of putatively functional 28S rDNA (Figure S4). However, parameter settings had a major effect on the mapping rate of divergent 28S-like fragments with the most inclusive setting (i.e., “-very-sensitive-local”) resulting in a >7-fold coverage increase in mapping of divergent 28S-like reads relative to the least inclusive setting (i.e., “-very-fast”), and >4-fold increase relative to Bowtie2 default settings (‘-sensitive’) (Figure S4). Using the “-local” setting as opposed to the Bowtie2 default “end-to-end” setting reduced artefacts of low coverage near the edges of reference sequences (Figure S5). As a result of these findings, we use “-very-sensitive-local” as RepeatProfiler's default mapping parameters to provide flexibility when mapping reads to non-conspecific reference sequences. In addition, full-length REs may give rise to various truncated repeats distributed throughout the genome (i.e., small REs may arise from fragments of larger REs) as in the example of 28S rDNA in our model species. Thus, permissive default settings allow more layers of repeat evolutionary history to be reflected in profiles. For cases where less permissive mapping is needed, the user can provide their preferred Bowtie2 settings.
Normalization coverage estimates showed a linear relationship with input read number in downsampled data sets well below 1× coverage (Figure 4a–c). Below ~0.3× coverage, the relationship between reads and average coverage is more variable and leads to a slight over estimation of coverage. This suggests that the normalization feature of the pipeline is robust to low-coverage input reads down to less than 0.5× coverage.

3.3.2 Read mapping and reference sequence divergence
Analysis of read mapping trends using simulated data showed no significant difference in mapping rates across the different species and reference sequences analysed. A plot of the combined data shows that using RepeatProfiler default settings, reference sequences with ≤5% sequence divergence show little decrease in mapping rates relative to the unmutated reference sequences (Figure 4d). Greater than 50% of reads still mapped to references with 13% sequence divergence, and >10% of reads still mapped with 20% divergence (Figure 4d). These settings allow for considerable divergence between the reference sequence and the sample reads, enabling comparative study across clades spanning shallow to moderate sequence divergence.
3.3.3 Phylogenetic validation
Phylogenetic analysis of abundant variants in profiles showed that variant signatures can have strong phylogenetic signal over short evolutionary scales. Our in-depth validation using Drosophila showed that across 37 TEs from which we generated trees, 28 (75.7%) recovered at least three of the five expected clades with greater than 50% bootstrap support (average bootstrap support =91.2%) (Figures S6 and S7). Of the remaining TEs, five recovered at least two expected clades and four recovered just a single clade. Profiles from 14 TEs recovered all five clades and the correct branching pattern of species. Nine additional TEs produced trees with correct branching patterns, except that one or more replicate samples within species formed a grade rather than clade for that species. We found a single case where a phylogenetic tree had at least moderate support (i.e., greater than 75% bootstrap support) for a relationship in discordance with the species tree (Figures S6 and S7). Finding cases where phylogenetic signal in one RE strongly contrasts signal from the majority of REs analysed (or the known species tree) may be useful for identifying lateral transfer of TEs among the study taxa, as has been observed in cases such as the P-elements in Drosophila willistoni and D. melanogaster (Daniels et al., 1990).
For 18 of the TEs that underwent phylogenetic analysis, we found data in the literature regarding their recent evolutionary activity (Bergman & Bensasson, 2007). Nine of the 18 TEs are classified as recently active—these recovered an average of 4.3 of the five expected clades (Figure S7). Six of 9 recovered all expected clades and the correct branching pattern. The nine inactive TEs recovered an average of 2.4 of the five clades and only one TE recovered all five clades with the correct branching pattern. These findings provide preliminary evidence that, over very short evolutionary time scales, recently active TEs may show stronger phylogenetic signal than TEs whose activity predates speciation events in the study taxa.
4 DISCUSSION
RepeatProfiler incorporates signals from repetitive sequences into comparative evolutionary studies over short time scales. Unequal crossing over and gene conversion between repeated DNA sequences can cause the concerted evolution of REs within species and the rapid fixation of differences between species (Coen et al., 1982; Dover, 1982, 1994; Strachan et al., 1982). These rapid evolutionary dynamics can be useful for understanding species boundaries, but repetitive sequences are rarely used in this context. Our approach expands on previous findings (Sproul, Barton, et al., 2020) that RE profiles can be stable within, but variable between, closely related species.
The pipeline's comparative features allow a user to define groups (putative species, populations, male vs. female, etc) a priori and test whether signatures in profile shape are correlated with group distinction. In cases where species-specific signatures are not present in profile shape, we show that patterns of sequence variants within profiles may contain such signatures (Figures S3 and S4). Phylogenetic analysis of variant patterns within profiles provides a quantitative approach to group samples based on profile signatures that may also give insights as to phylogenetic relationships of recently diverged taxa (e.g., within species groups) (Figures 3, 5, Figure S2–S3, Figure S5–S6), as discussed more below. Combining evidence from many REs can yield a subset in which recurring patterns become evident. Not all REs are expected to produce strong signal at the species level; however, the visual enhancements in profiles produced by this pipeline are designed to simplify identification of REs that show interesting patterns, such as dynamic satellite DNAs and recently active TEs.

Previous studies show that patterns of repeat abundance hold phylogenetic signal (Dodsworth et al., 2015); however, repeat abundance can evolve so dynamically (even between sister taxa, let alone across deep phylogenetic splits) that homoplasy can present a major challenge when encoding phylogenetic characters from raw abundance (Martín-Peciña et al., 2019). Rather than use repeat abundance as the character of interest, our approach encodes the signature of abundant variants relative to a common reference sequence. This signature can be highly stable despite fluctuation in absolute copy number that is expected even within species (e.g., due to unequal exchange). Our validation in Drosophila shows evidence of good phylogenetic signal (Figures 3, 5, Figure S2–S3, Figure S5–S6) across divergences as recent as 240 kya (Garrigan et al., 2012), particularly when analysing recently active TEs. Importantly, we observe this result using reads that have a mixed history of sample preparation and sequencing and a mixture of male and female specimens (Table S1).
In addition to providing high-resolution evidence of species boundaries, RE profiles can give useful insights into genome architecture and repeat biology. For example, major shifts in repeat abundance across samples can provide evidence of rapid genome evolution through repatterning of repeat architecture (Sproul, Barton, et al., 2020). Finding uneven coverage of REs with sharp boundaries can reveal differential amplification of truncated/fragmented copies of that repeat, which suggest the spread of novel satellite DNA sequences (Figures 2 and S4) or evidence of recent TE activity (Figure S1). REs that show strongly supported phylogenetic relationships that are discordant with species trees can reveal evidence of horizontal transfer. Sex-specific patterns of relative abundance within a species can give insight into dynamics of sex-linked REs which can be difficult to study with assembly-based approaches, given the highly repetitive nature of sex chromosomes.
When RepeatProfiler is used to compare patterns across samples, all sample reads are mapped to a common reference sequence. This approach enables direct comparison of profiles at each position along the reference. Changes observed across samples may be due to both differential abundance of that repeat and/or sequence divergence/indels relative to the reference sequence. The potential limitation of mapping to a common reference sequence is that there is a limit to the extent of divergence which can be included in the same analysis, however, the default parameters of the pipeline enable flexible read mapping across moderate sequence divergence (Figure 4d). We encourage users to consider the combined effects of mapping specificity and reference sequence choice as they design RepeatProfiler runs. By tailoring the choice of reference sequences and run parameters, they can ensure that analyses best meet the needs of their specific question. Additional discussion of this and other topics that users may find useful in considering how to best use the tool are available on the RepeatProfiler website under “Tips for Users” (https://johnssproul.github.io/RepeatProfiler/tips/).
4.1 RepeatProfiler and other programs
Studying repeat dynamics with short-read data has a few distinct advantages, including: (i) The low-cost of data generation; (ii) public repositories contain vast amounts of these data from thousands of organisms; and (iii) short-read, whole-genome shotgun data approaches can bypass problems caused by gaps and multimapping issues in genome assembly-based approaches and allow entire genomic repeat dynamics to be measured (albeit without the context of their position in genomes). A growing number of tools are designed for studying REs using short-read data, some of which are outlined in Table 1. We developed RepeatProfiler to fill a gap in available tools by allowing efficient study of specific REs in a quantitative comparative framework.
Program | repeatprofiler | deviate | repeatexplorer | transposome | dnapipete | repdenovo |
---|---|---|---|---|---|---|
Input | Low-coverage short read data repeat reference sequences | Short read data repeat library | Low-coverage short read data | Low-coverage short read data | Low-coverage short read data | Short read data |
Primary function | Visualization and quantitative comparison of RE profiles | Visualization of TE profiles, inference of TE structural features | De novo assembly, annotation | De novo assembly, annotation | De novo assembly, annotation | De novo assembly, annotation |
Output |
Two types of enhanced RE profiles Statistical comparison of profile shape phylip files for phylogenetic analysis of variant profiles Summary statistics |
RE profiles with variants and annotated structural features Abundance estimates of TEs |
Tandem repeat analysis (with consensus sequences) Cluster/superclusters annotations & contigs Repeat annotation summary |
Genomic TE abundance estimates Summary of TEs families Repeat contigs |
Chart of repeat abundance Chart of repeat landscapes/repeat age distribution Repeat contigs |
Repeat coverage summary Repeat contigs |
A growing number of tools are tools exist for de novo repeat assembly and initial annotation of REs from short-read data such as RepeatExplorer2 (Novák et al., 2013, 2017) and dnaPipeTE (Goubert et al., 2015). In addition to assembly and annotation, both programs are also useful in quantifying repeat abundance from short-read data. RepeatProfiler complements these tools, allowing for more in-depth study of specific REs downstream of these programs. In addition, repeat assembly/annotation programs can be a useful source of reference sequences. For example, RepeatExplorer2 includes consensus sequences for satellite DNA, LTRs, and rDNA as standard output, which can be directly fed into RepeatProfiler as reference sequences. We provide a more detailed discussion of this approach to obtaining reference sequences, as well as special considerations for making profiles of tandem repeats such as satDNAs, to Supporting Information Methods and our online tutorial (https://johnssproul.github.io/RepeatProfiler/).
DeviaTE (Weilguny & Kofler, 2019) is most similar to RepeatProfiler in that both tools enable the visualization of read mapping profiles for one or more samples relative to repeat reference sequences. Both tools produce publication-quality graphical output and have minimal data input requirements enabling study of REs in groups with limited genomic resources. The two programs have complementary strengths. DeviaTE includes features that offer deeper insight into transposable element biology through analysis of split-reads to detect and annotate specific structural features (e.g., internal deletions, truncations) of transposable elements. RepeatProfiler offers features that enable quantitative comparison of profile signatures across samples through correlation analysis of profile shape and through enabling phylogenetic analysis of abundant variants as molecular morphological characters.
ACKNOWLEDGEMENTS
We thank Ching-Ho Chang, Xiaolu Wei, Beatriz Navarro-Domínguez, Lucas Hemmer, Danna Eickbush, Cécile Courret, David Maddison, James Pflug, Antonio Gomez, and Olivia Boyd, Ricardo Utsunomia, and Caitlin Hudecek for helpful feedback on the pipeline and help testing its functionality. This work was supported by National Institutes of Health General Medical Sciences grant R35GM119515, and National Science Foundation grant MCB-1844693 awarded to A.M.L. J.S.S. is supported by a NSF Postdoctoral Research Fellowship in Biology (DBI-1811930).
AUTHOR CONTRIBUTIONS
J.S.S. originally conceived the pipeline and its core features. J.S.S., A.M.L., and S.N. improved conceptual design and features. S.N., J.S.S., and A.G. wrote the code. S.N., J.S.S., and A.G. conducted validation experiments. S.N. wrote the first draft of the manuscript and all authors contributed to subsequent drafts.
Open Research
DATA AVAILABILITY STATEMENT
repeatprofiler is available through GitHub (https://github.com/johnssproul/RepeatProfiler). The data used to demonstrate and validate the tool were downloaded from NCBI’s Sequence Read Archive with accession numbers provided in Table S1.