Volume 53, Issue 6 pp. 754-762
INVITED REVIEW
Open Access

A renaissance of microRNAs as taxonomic and phylogenetic markers in animals

Bastian Fromm

Corresponding Author

Bastian Fromm

The Arctic University Museum of Norway, UiT – The Arctic University of Norway, Tromsø, Norway

Correspondence

Bastian Fromm, The Arctic University Museum of Norway, UiT – The Arctic University of Norway, Tromsø, Norway.

Email: [email protected]

Search for more papers by this author
First published: 21 June 2024
Citations: 6

Abstract

Molecular markers for tracing animal sample origins and compositions are critical for applications such as parasite detection, contamination screening, and sample authentication. Among these, microRNAs have emerged as promising candidates due to their deep conservation, near-hierarchical evolution, and stability. I here review the suitability of microRNAs as taxonomic and also phylogenetic markers and show how careful annotation efforts and the establishment of the curated microRNA gene database MirGeneDB and tools like MirMachine have revitalized microRNA research. These advancements enable accurate phylogenetic and taxonomic studies, highlighting microRNAs' potential in resolving long-standing questions in animal relationships and extending to applications in ancient DNA and environmental RNA analysis. Future research must focus on expanding microRNA complements across all Metazoa and further improving annotation methodologies.

1 INTRODUCTION

Molecular markers to easily trace the origin or the composition of a given animal sample are highly sought after: may it be, for example, to test for parasites in the blood of livestock, to screen for biological contaminants in experimental data, or to detect animal traces from environmental samples. Moreso, it is often desired to confirm the authenticity of samples, their exact taxonomic origin down to the species level. This may, for instance, be the case in samples that are very old and likely contaminated such as those from permafrost or from scientific museum collections. Suitable molecular markers are also interesting to understand whether groups of organisms belong to the same taxonomic rank, i.e. if groups of organisms belong to the same species or if they belong to the same management unit or not. Such questions of relatedness are of course interesting at all taxonomic and, ultimately, phylogenetic levels, where, after centuries of research, many open questions of the relationships of animals remain (Bleidorn, 2019; Cannon et al., 2016; Dunn et al., 2008; Edgecombe et al., 2011; Laumer et al., 2018; Marlétaz et al., 2008; Najle et al., 2023; Philippe et al., 2011; Ryan et al., 2013; Schiffer et al., 2022; Simion et al., 2017).

Suitable markers should on the one hand be deeply conserved so they can be detected across geological time scales and on the other hand, in a clock-like manner, for instance by sequence-changes or the evolution of novel genes, trace both evolutionary deep and more recent splits of e.g. phyla and species without evolving convergently. Importantly, such markers should be readily detectable and ideally be very stable from a range of relevant samples. Beginning with Carl Woese, the pioneer of metagenomics using ribosomal RNA to differentiate (and discover!) domains of life (Woese & Fox, 1977), to mitochondrial markers such as cytochrome oxidase (Brown et al., 1979) (or ITS in plants [Baldwin et al., 1995]) for taxonomic barcoding and ultimately phylogenomics using full or near complete complements of all protein coding gene sequences for a set of species, each could only fulfil one or the other criteria (Delsuc et al., 2005). Today, no marker is established that is used widely as a taxonomic and phylogenetic marker in animals, or any other organismic group.

With the discovery of microRNAs in 1993 (Lee et al., 1993) and the realization that they are highly conserved in all animals in 2000 (Pasquinelli et al., 2000) and 2001 (Lagos-Quintana et al., 2001; Lau et al., 2001; Lee & Ambros, 2001), respectively, a new field of research was established (see Zamore, 2020). Shortly after, a wave of initially very promising studies appeared showing the deep conservation of microRNAs and their patterns of near-hierarchical evolution (Grimson et al., 2008; Hertel et al., 2006; Sempere et al., 2006). microRNAs are short non-coding gene-regulators that regulate the translation of messenger RNAs (mRNAs) and can act both as either switches or with more subtle effects (Bartel, 2018). Today, microRNAs are widely studied because of their roles in development and disease with potential as clinical biomarkers. Within the exponentially expanding non-coding RNA field, microRNAs are most studied, and more than 100,000 papers have been published by now on microRNAs alone. However, although there was an initial euphoria for microRNA-based phylogenetics (Campbell et al., 2011; Devor & Peek, 2008; Erwin et al., 2011; Fromm et al., 2013; Sperling & Peterson, 2009; Tarver et al., 2013; Wiegmann et al., 2011) and taxonomic studies (Fromm et al., 2014; Helm et al., 2012), the suitability of microRNAs for addressing such questions came into question due to supposed evidence for homoplastic (convergent) evolution of microRNAs in some groups and a proclaimed relatively high rate of independent losses in some animal lineages (Dunn, 2014; Thomson et al., 2014). Furthermore, there was only a relatively small community that advocated for the usefulness of microRNAs as phylogenetic and taxonomic markers, and this marked a substantial decline in the popularity of microRNAs for phylogenetics and taxonomy. However, it also turned out that the correct identification and annotation of microRNAs was an issue of concern and a set of carefully conducted studies was initiated that identified the source of the contradicting reports: while previous pioneering studies had taken a labour intensive manual annotation approach to microRNAs, the new study used a public repository of published microRNA complements (miRBase [Kozomara et al., 2019]) at face value without curation (Tarver et al., 2018). Indeed, when carefully reanalyzing the same datasets after curating the data, the raised concerns were not supported by the data, and it became apparent that an effort for addressing the underlying issues in the microRNA annotation field was needed. For years, a major concern in microRNA research had already been the quality of miRBase with estimates of 2/3 false-positive entries (Axtell & Meyers, 2018; Castellano & Stebbing, 2013; Chiang et al., 2010; Fromm et al., 2015; Fromm, Domanska, et al., 2020; Fromm, Høye, et al., 2022; Fromm, Keller, et al., 2020; Guo et al., 2020; Jones-Rhoades, 2012; Langenberger et al., 2011; Ludwig et al., 2017; Meng et al., 2012; Tarver et al., 2012; Taylor et al., 2014; Wang & Liu, 2011). It is important to note that miRBase hosts published microRNA sequences, and is, thus, highly dependent on the quality of published studies. Furthermore, the database is heavily biased toward organisms that have been studied most. Therefore, miRBase annotations are often incomplete for non-model species, which, in turn, may have been interpreted as absence or loss of specific microRNAs and lead to the ‘critical appraisal’ (Thomson et al., 2014). More so, miRBase uses an outdated and inconsistent nomenclature which leads to problems in identifying homologous genes in different organisms. Through the years this led to numerous scattered efforts to develop new naming systems (Budak et al., 2016) and study-specific reannotations of microRNA complements (Chiang et al., 2010; Fromm et al., 2013; Grimson et al., 2008; Jan et al., 2011; Ruby et al., 2006, 2007). Unfortunately, such progress never led to an update of miRBase.

To tackle this problem, MirGeneDB, a database of manually curated microRNA genes, was launched in 2015, and has since become the new gold standard for microRNA identification, annotation and analysis (Fromm et al., 2015; Fromm, Domanska, et al., 2020; Fromm, Høye, et al., 2022). Building upon earlier studies in the field MirGeneDB has been instrumental to facilitate a new era of microRNA-based research, including, but not limited to phylogenetic and taxonomic studies.

2 microRNAs AS PHYLOGENETIC MARKERS

microRNAs are well suited as phylogenetic markers in animals due to their (1) deep conservation, (2) their near-hierarchical evolution and (3) rare secondary loss events (Figure 1). An in depth review by Tarver et al. (2013) lists additional features, such as the rarity of nucleotide substitution in mature microRNAs and the seed in particular, which readily allows to identify orthologues between species. Combined with only very rarely observed convergent evolution of microRNAs (see Wang et al., 2024, for the very few examples), this allows for the computational reconstruction of phylogenetic trees both based on the presence and absence of microRNAs, i.e. the microRNA complements, as well as their concatenated nucleotide sequences, including flanking regions (Kenny et al., 2015). The phylogenetic analysis of microRNA complements is particularly useful for addressing higher level systematic issues, whereas phylogenetic analysis at the nucleotide sequence level may also be useful at lower taxonomic ranks such as the species, genus and family or order levels.

Details are in the caption following the image
Deep conservation and hierarchical evolution of microRNAs in Metazoa. (a) microRNA schematic highlighting the primary microRNA hairpin like structure (2D view) composed of precursor microRNA (pre-microRNA) which gets processed into mature (red), star (blue) and loop part (yellow) (Note that there also exist 3′ mature and co-mature microRNAs but are not shown here). (b) Multisequence alignment of LET-7 family orthologues across Metazoa. Blue coloration of sequences indicates sequence similarity. Note that LET-7s are 5′ matures and how different mature, loop and star sequences are conserved. Icons highlight human, C. elegans and D. melanogaster sequences that are fully conserved for their mature microRNA sequences. (c) Secondary structures of LET-7 examples of the same species. (d) Banner plot of 75 Metazoan microRNA complements from MirGeneDB 2.1 highlights the strong conservation even at paralogue level (heat-function) and the near-hierarchical evolution. See Figure S1 for a fully detailed version of the banner plot). Coloured bars on top depict selected phylogenetic nodes of origin that can also be found in (e) Reconstructed phylogenetic tree of Metazoan species based on the presence and absence of microRNA families. Branch lengths indicate changes in microRNA complements highlighting specific bursts of microRNA evolution (colours like in d).

As indicated before, the establishment of MirGeneDB marked a huge progress for microRNA research. Very recently, the development of the microRNA prediction algorithm MirMachine (Umu et al., 2023) has further substantially alleviated the methodological portfolio. Briefly, MirMachine is based on the application of covariance models of microRNA families trained on the manually annotated microRNA complements in MirGeneDB, and allows for the prediction of conserved microRNA complements from reference genomes without a need for specific smallRNA sequencing data (see also the similar tool ncOrtho, which uses microRNA covariance models of human microRNAs from MirGeneDB [Langschied et al., 2023]). MirMachine enables the fast annotation of microRNAs, their family annotation and sequence identification for comparative, e.g. phylogenetic studies. Such studies are relevant as only a relatively small proportion of the relationship of major animal groups has been solved (Figure 2).

Details are in the caption following the image
Simplified phylogenetic tree of Metazoan phyla highlighting open questions in our understanding of their relationship (yellow circles). The green highlighted phyla have microRNA complements annotated from at least one representative species (MirGeneDB).

It is likely that MirMachine derived conserved microRNA sequences will be able to solve many of these outstanding questions, however, given that more than half of the animal phyla have not been yet sampled for their microRNA complements, there is a probability that MirMachine models might not be accurate or sensitive enough to capture all conserved microRNAs in these genomes. Importantly, clade specific microRNAs might have to be discovered de novo and, hence, novel microRNA prediction using smallRNA sequencing data, might be required in addition to a near-mandatory MirMachine step. For the annotation of novel microRNAs, a range of tools has been published over the years that have been used widely, such as miRDeep (Friedländer et al., 2008, 2012), miRanalyzer (sRNAbench) (Aparicio-Puerta et al., 2019; Hackenberg et al., 2009), or MirMiner (Wheeler et al., 2009). Today, there is no, one-fits-all tool and each of the tools come with connected challenges: (1) they require command-line skills, (2) careful manual curation of outputs and (3) are not necessarily designed to process today's ultra-high throughput datasets (see Fromm, 2016; Fromm, Zhong, et al., 2022) (Figure 3a). For smallRNA sequencing-based analyses, the tool miRTrace was especially developed for microRNA quality control that should be included in a first step for the de novo prediction of microRNAs and has, in addition, a feature that makes it very relevant for taxonomic questions: taxonomic tracing with microRNAs (Kang et al., 2018).

Details are in the caption following the image
Pipeline for microRNA study and example for microRNAs as taxonomic markers. (a) RNA samples ideally from several tissues should be extracted and smallRNA libraries sequenced in single end mode for approximately 20 million reads per sample. For quality control of smallRNA sequencing data, miRTrace (Kang et al., 2018) should be used as a first step. miRTrace can already give basic taxonomic information on likely origin and the presence of contaminants. The next step is a MirMachine analysis (Umu et al., 2023) which requires the presence of a reference genome and will predict conserved microRNAs. Tools such as MirMiner (Fromm et al. in prep; Wheeler et al., 2009) that use smallRNA sequencing data, reference genomes can be used for the verification of MirMachine results and the prediction of novel microRNAs. (b) Relevant part of known microRNA family origins in the evolution of Metazoans. The detection of these microRNAs enables the association of samples to taxonomic groupings. In the example of the study of smallRNAs in Octopus vulgaris (Zolotarov et al., 2022), we detected Eumetazoan, Protostome, Lophotrochozoan, Platytrochozoan and molluscan microRNA families that clearly authenticated the sequencing experiment. (c) Expansion of Octopus microRNA complements relative to selected other metazoans. Note the coloration of annotated nodes of origin and how the detection of them from other samples, e.g. environmental or ancient, would facilitate the identification of corresponding node association for any sample.

3 microRNAs AS TAXONOMIC MARKERS

Because microRNAs evolve near-hierarchical, they cannot only help to solve relationships between organism's groups based on shared microRNAs, but can be used, in a barcode kind of way, to classify species of unknown relatedness to known taxonomic groups, such as phyla or, ideally, down to species level (see Table S1 for exhaustive list of taxonomic node specific microRNA family gains and losses). For instance, in a study on Octopus (Zolotarov et al., 2022) the microRNA complements of two Octopus species and two other Cephalopods were annotated and compared to known microRNA complements in MirGeneDB (Figure 3b,c). The cephalopods all shared eumetazoan, bilaterian and protostome microRNA families in addition to Lophotrochozoan and Platyrochozoan, but most importantly to molluscan microRNA families, clearly identifying them as Molluscs. Using MirMiner (Wheeler et al., 2009), a large number of novel microRNA families was identified and shared among all cephalopods, a large number in coleoid cephalopods and a similar number in both Octopus species. Based on the presence of any of these families in a dataset, we can today confidently trace the origin of any sample.

In the mentioned miRTace tool (Kang et al., 2018) part of this information is implemented at a relatively coarse level and is used to e.g. test for contamination in smallRNA sequencing data in a quality control step and to get a first idea of data quality and composition after sequencing. Efforts to incorporate annotation information and phylogenetic information on microRNAs from MirGeneDB are under way and will substantially improve the resolution of miRTrace and massively expand potential applications.

Another example of using microRNAs as taxonomic markers is the study of sequence variation in the usually very strongly conserved microRNAs to test for differences between populations of species or different species. Examples from the flatworm field are the use of microRNAs in the salmonid ectoparasites Gyrodactylus salaris and Gyrodactylus thymalli that showed that the supposed two species show stronger difference within the species than between, which suggests a synonymization of the two, with wide reaching consequences for management (Fromm et al., 2014). In a study on the OpisthorchisClonorchis species complex, an important group of endoparasitic trematodes (liver flukes), microRNAs were shown to be useful to differentiate between the three species, and importantly between the O. viverrini species, which is classified as bio carcinogen and the others.

Such differences of microRNAs between pathogens species of different severity, and between pathogen and hosts, are being implemented in studies to detect pathogens from blood by using microRNAs as taxonomic blood markers (Cheng et al., 2013; Ghalehnoei et al., 2020; Quintana et al., 2015; Tritten et al., 2014), although more systematic approaches using ultra deep sequencing or targeted assays, are currently missing.

Recently the new field of paleotranscriptomics, the study of historic or ancient samples for their RNA expression profiles, has become a new area of interest for accurate—and stable—markers to verify the taxonomic authenticity of the data (Smith et al., 2019; Smith & Gilbert, 2019). In a detailed analysis on the 14,300 year old Tumat puppy microRNA complement from small RNA sequencing, we have shown that despite the fact that the majority of microRNA reads are identical between human and dog, and hence not taxonomically informative, we can detect microRNAs that either show canid specific sequence differences, or are truly canid specific (Fromm, Tarbier, et al., 2020). We followed up on this study by sequencing historical museum samples of the extinct Tasmanian tiger (thylacine) and could, among many conserved families and microRNAs specific to marsupials, identify microRNAs specific to Thylacine and novel to science (Mármol-Sánchez et al., 2023). Those microRNAs are especially exciting as they actually represent extinct genes which could not have been found without RNA sequencing data.

4 FUTURE PERSPECTIVES

microRNAs currently experience a renaissance as suitable markers for various fields of research, among them as taxonomic and phylogenetic studies. Today, the limiting factors for microRNA-based research is a lack of a complete picture of microRNA complements from all animal phyla and the lack of more user-friendly and scalable computational tools for the prediction and annotation of novel microRNAs sequences from smallRNA sequencing data. Given the extensive number of studies producing smallRNA sequencing data in the past, it is surprising and worth mentioning that, currently, a substantial number of datasets has neither been publicly released or, much more common, are very hard to find in public repositories. Adherence to FAIR principles (Wilkinson et al., 2016) and more accessible depositories are warranted.

Although the number of reference databases continues to increase in number (see Fromm, Keller, et al., 2020), the focus should be on the use of highly curated databases that contain bona fide microRNA annotations (see Fromm, Zhong, et al., 2022). For MirGeneDB, the most limiting factors are the availability of smallRNA sequencing data for available nuclear genomes of missing phyla, as well as the laborious time-consuming manual annotation of microRNAs, especially for the correct paralogue identification. Improvement of available genome annotation and automated microRNA annotation algorithms using genome references only, such as MirMachine, might pave the way to a de novo prediction of full complements with accurate naming and significantly reduced need for manual curational work.

Exemplified in the paleotranscriptomics field, microRNAs have proven to be the molecule of choice as taxonomic markers because they are apparently also very stable due to their relatively small size, the fact they are capped and their association to proteins (see Friedländer & Gilbert, 2024). The ability to accurately differentiate species of parasites with severely divergent pathogenicity, the detection of parasitic species in cattle or livestock species or the detection of novel taxonomically informative microRNAs in species that haven't been studied yet holds a huge potential for humanity and a unique chance to solve outstanding scientific questions in biosystematics. For the latter, microRNAs will continue to contribute to open questions of Metazoan Phylogeny in e.g. gastrotrich paraphyly (Fromm et al., 2019), internal molluscan (Rosani et al., 2021) and ecdysozoan relationships (Campbell et al., 2011) and the Xenacoelomorpha discussion (Philippe et al., 2011; Schiffer et al., 2022) and, depending on available data on hitherto uncharted animal phyla (Figure 2). While the focus of this review is animals, there is clearly a large potential for microRNAs as taxonomic and phylogenetic markers in plants (as shown before by e.g. Taylor et al. (2014)), but there is an even larger backlog of annotation work and a less clear phylogenetic resolution in them.

We are only beginning to understand the full potential of microRNAs as taxonomic and phylogenetic markers in other fields. For instance, it was recently suggested to use RNA as environmental markers (see Stevens & Parsley, 2023), i.e. to extract RNA from sea water or sediment samples to reconstruct information on taxonomic distribution. Due to RNA, especially microRNAs, being often not single stranded but either protected by associated proteins (Argonaute), or structured in hairpins of other tertiary structures, RNA might be more stable than DNA and very well suited for this task. This is further supported by the fact that there are many thousands of copies of a range of RNA molecules in each cell, enhancing the probability of RNA detection over often diploid DNA copies from the cells by several orders of magnitude.

Finally, it is exciting to think that microRNAs can not only be used to distinguish between species as phylogenetic or taxonomic markers but show tissue and cell-type specific expression in addition to clear expression differences between developmental stages. This could theoretically give insights on developmental stage, age or injuries from ancient samples. Today such approaches are being established already in forensics (Sauer et al., 2017).

ACKNOWLEDGEMENTS

I would like to acknowledge the Tromsø forskningsstiftelse grant (TFS) [20_SG_BF ‘MIRevolution’] and am grateful for the invitation.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.