Volume 117, Issue 1 pp. 58-70
Research Article
Full Access

The use of museum samples for large-scale sequence capture: a study of congeneric horseshoe bats (family Rhinolophidae)

Sebastian E. Bailey

Corresponding Author

Sebastian E. Bailey

School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS UK

Corresponding author. E-mail: [email protected]Search for more papers by this author
Xiuguang Mao

Xiuguang Mao

School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS UK

SKLEC, Institute of Molecular Ecology and Evolution, East China Normal University, Shanghai, 200062 China

Search for more papers by this author
Monika Struebig

Monika Struebig

School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS UK

The Genome Centre, John Vane Science Centre, Queen Mary University of London, Charterhouse Square, London, EC1M 6BQ UK

Search for more papers by this author
Georgia Tsagkogeorga

Georgia Tsagkogeorga

School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS UK

Search for more papers by this author
Gabor Csorba

Gabor Csorba

Hungarian Natural History Museum, Baross 13, 1088 Budapest, Hungary

Search for more papers by this author
Lawrence R. Heaney

Lawrence R. Heaney

The Field Museum of Natural History, 1400 S. Lake Shore Drive, Chicago, IL, 60605-2496 USA

Search for more papers by this author
Jodi Sedlock

Jodi Sedlock

The Field Museum of Natural History, 1400 S. Lake Shore Drive, Chicago, IL, 60605-2496 USA

Search for more papers by this author
William Stanley

William Stanley

The Field Museum of Natural History, 1400 S. Lake Shore Drive, Chicago, IL, 60605-2496 USA

Search for more papers by this author
Jean-Marie Rouillard

Jean-Marie Rouillard

MYcroarray, 5692 Plymouth Road, Ann Arbor, MI, 48105 USA

Search for more papers by this author
Stephen J. Rossiter

Stephen J. Rossiter

School of Biological and Chemical Sciences, Queen Mary University of London, London, E1 4NS UK

Search for more papers by this author
First published: 03 August 2015
Citations: 7

Abstract

Museums hold most of the world's most valuable biological specimens and tissues collected, including type material that is often decades or even centuries old. Unfortunately, traditional museum collection and storage methods were not designed to preserve the nucleic acids held within the material, often reducing its potential viability and value for many genetic applications. High-throughput sequencing technologies and associated applications offer new opportunities for obtaining sequence data from museum samples. In particular, target sequence capture offers a promising approach for recovering large numbers of orthologous loci from relatively small amounts of starting material. In the present study, we test the utility of target sequence capture for obtaining data from museum-held material from a speciose mammalian genus: the horseshoe bats (Rhinolophidae: Chiroptera). We designed a ‘bait’ for capturing > 3600 genes and applied this to 10 species of horseshoe bat that had been collected between 93 and 7 years ago and preserved using a range of methods. We found that the mean recovery rate per species was approximately 89% of target genes with partial sequence coverage, ranging from 3024 to 3186 genes recovered. On average, we recovered 1206 genes with ≥ 90% sequence coverage, per species. Our findings provide good support for the application of large-scale bait capture across congeneric species spanning approximately 15 Myr of evolution. On the other hand, we observed no clear association between the success of capture and the phylogenetic distance from the bait model, although sample sizes precluded a formal test.

Introduction

Museum collections have long served as repositories for physical biological data in the form of specimens and tissue samples. Some museums have amassed material from across the globe with the intention of documenting the existence of all unique forms of life. As a consequence, collections frequently house many poorly known, rare, endangered, and extinct taxa, some of which may not have been seen again subsequent to their initial collection (Sinha, 1973; Novotny & Basset, 2000; Thessen, Patterson & Murray, 2012).

Aside from the importance of museum collections for research and education, they may also represent immensely rich genetic resources. For many years, researchers have incorporated sequences from museum specimens into studies of systematics, evolution, and conservation (Carstens & Knowles, 2007; Paijmans, Gilbert & Hofreiter, 2013; Wiley et al., 2013). More recently, several larger-scale campaigns have been launched to genetically index specimens held in museums (Schindel et al., 2011; Lim, 2012; Brugler et al., 2014). However, many of the traditional collection, storage and curation methods found in museums are incompatible with contemporary genetic research. In particular, the practice of fixing material in formaldehyde or ethanol may damage or degrade nucleic acids, and thereby reduce any potential viability in genetic studies, which use traditional Sanger sequencing. Thus, the recovery of a single gene, such as the mitochondrial cytochrome oxidase 1 (COI), in museum specimens has commonly been performed (Almeida et al., 2014; Brace et al., 2014; Li & Liu, 2014). For example, the frozen tissue collections of mammals and birds held at, respectively, the Royal Ontario Museum and the National Museum of Natural History of the Smithsonian Institution, have been DNA barcoded using COI sequences (Schindel et al., 2011; Lim, 2012).

Furthermore, even when material is suitable, obtaining sufficient tissue from specimens for genetic analyses is often viewed as unacceptably destructive, with some rarer material placed ‘off-limits’ to genetic investigation. These conflicts of interest are unfortunate, not least because the systematics of many taxa, including even relatively well-described groups such as vertebrates, is frequently unclear (Guschanski et al., 2013; Tsagkogeorga et al., 2013). In such cases, the best hope of settling taxonomic disputes and standardizing nomenclature is to match curated specimens to genetic sequences.

The advent of high-throughput sequencing technologies (‘Next Generation Sequencing’) has provided new and exciting opportunities for utilizing collection-derived material (Bi et al., 2013; Nachman, 2013). In particular, the short fragment sizes, that often characterize degraded DNA isolated from museum specimens, pose less of a problem for high-throughput sequencing technologies than for traditional Sanger sequencing. This is because current sequence read lengths obtained from common platforms, such as Illumina, are relatively short (approximately 100–250 bp), whereas traditional Sanger sequencing read lengths are much longer and so are more suited to longer stretches of DNA. The second main advantage of high-throughput sequencing is the sheer volume of data generated, allowing millions of short reads to be sequenced in parallel and assembled into potentially longer ‘contigs’. Indeed, the ability to obtain very large volumes of sequence data from a small amount of starting material means that, at least in theory, high-throughput sequencing can also reduce the need to resample valuable specimens on multiple occasions for different projects. It follows that new sequencing methods allow researchers to sequence a wide range of taxa and loci, and to a much greater depth, for a fraction of the cost per base when compared to Sanger sequencing (Blencowe, Ahmad & Lee, 2009; Zhou et al., 2010; Rowe et al., 2011; Treangen & Salzberg, 2012).

Previous efforts towards high-throughput sequencing of museum specimens have been limited to either relatively few taxa (Bi et al., 2012, 2013) or to small numbers of loci (Schindel et al., 2011; Guschanski et al., 2013). However, some recent studies have applied shotgun sequencing to assemble the genomes of collection-based specimens of mammals, insects, fungi and plant species, although coverage has tended to be low (Rowe et al., 2011; Staats et al., 2013).

One application of high-throughput sequencing that appears to be especially promising for working with museum collections is so-called ‘target sequence capture’. Here, a set of focal genes are ‘pulled-down’ from a larger pool of DNA fragments through the use of sequence-specific baits. Because baits manufactured from a single taxon have been shown to recover loci across related taxa (Mason et al., 2011; Enk et al., 2014), this method is potentially very useful for phylogenomic studies (Mason et al., 2011; Enk et al., 2014). Moreover, the bait sequences themselves are short, and so this approach is also well-suited to fragmented DNA (Mason et al., 2011; Tin, Economo & Mikheyev, 2014) of the sort that is often isolated from specimens in collections.

Although a relatively new method, several studies have applied target sequence capture to study samples held in collections. Templeton et al. (2013) used baits to recover the mitogenomes from 2500-year-old human archaeological remains, and other studies have also reported successfully capturing and sequencing mitochondrial genes of other species held in collections (Hebert et al., 2003; Mason et al., 2011; Sawyer et al., 2012; Paijmans et al., 2013). Compared to nuclear DNA, mitochondrial DNA is less prone to deterioration during preservation and, as a result of its high copy number, is more readily recovered from limited material (Mason et al., 2011; Guschanski et al., 2013). Nonetheless, mitochondrial DNA also displays several characteristics that may limit its use for evolutionary inference, such as its maternal mode of inheritance, lack of recombination, and the frequent phylogenetic discordance seen between mitochondrial and nuclear gene trees (Desalle & Giddings, 1986; Cronin, 1991; Guschanski et al., 2013; Morgan, Creevey & O'Connell, 2014). Recently, sequence capture has also been used to pull-down nuclear DNA sequences of museum specimens of rodents (Bi et al., 2013), endangered primates (Guschanski et al., 2013), and extinct species of proboscideans (Enk et al., 2014).

The success of recent advances in the detection and recovery of DNA from so-called ‘ancient’ tissue (Bos et al., 2011; Fu et al., 2013; Devault et al., 2014; Meyer et al., 2014) has allowed access to the wealth of preserved specimens and tissues in museums across the world. Many similarities can be drawn between ancient tissues and those from museums and most techniques are transferable and may prove to be more effective.

In the present study, we applied a target sequence capture method to samples obtained from museum specimens of horseshoe bats (Rhinolophidae: Chiroptera) with the aim of recovering several thousand candidate genes, including both nuclear and mitochondrial genes. Horseshoe bats number approximately 80 recognized species (Csorba, Ujhelyi & Thomas, 2003), all of which belong to the single genus Rhinolophus. Despite their relative lack of overall morphological divergence, horseshoe bats can often be differentiated from each other on the basis of body size, the shape and size of their noseleaf (Csorba et al., 2003), and their peak echolocation call frequency (Kingston & Rossiter, 2004; Stoffberg, 2007; Sedlock & Weyandt, 2009; Stoffberg et al., 2010; Stoffberg, Jacobs & Matthee, 2011; Taylor et al., 2012). Horseshoe bats have long been recognized as important model organisms in auditory research (Long & Schnitzler, 1975; Fenton & Bell, 1981; Jones & Ransome, 1993; Li et al., 2007), and have also been subject of many molecular studies ranging from population genetics (Rossiter et al., 2000; Chen et al., 2009) and phylogeography (Flanders et al. 2009, 2011; Mao et al. 2010, 2013; Dool et al. 2013) to social structure (Rossiter et al., 2000; Puechmaille et al., 2014; Ward et al., 2014).

Molecular dating analyses calibrated with fossils suggest horseshoe bats arose approximately 15 Mya (Teeling et al., 2005; Foley et al., 2015) when they diverged from their sister taxon, the Hipposideridae. They subsequently radiated into a number of clades, currently recognized as approximately 15 species groups, with rapid diversification being seen over the past few million years (Csorba et al., 2003; Guillén-Servent, Francis & Ricklefs, 2003). Findings have revealed numerous new cryptic taxa present within the genus, suggesting that current numbers of species are underestimated (Kingston & Rossiter, 2004; Sun et al., 2008; Monadjem et al., 2010; Taylor et al., 2012). To date, most phylogenetic studies of horseshoe bats have focused on one or only a few loci (Guillén-Servent et al., 2003; Li et al., 2006; Stoffberg et al., 2010; Taylor et al., 2012). Recently, Foley et al. (2015) examined approximately 10 kb in 40 species, including ten Rhinolophus spp., and produced a well-resolved phylogeny for the taxa included.

To examine the utility of target sequence capture across congeneric taxa, we designed baits against coding sequences in two species of horseshoe bat, and used these to obtain new data from across the group. As a result of the scarcity of many of these species, we used material held in museum collections that encompass a range of ages and preservation methods. In particular, we were interested in determing to what extent the success of capture varied with phylogenetic relatedness between ‘bait’ and target species.

Material and methods

Bait design and synthesis

For target sequence capture, we designed a set of baits based on sequences obtained from the transcriptomes of two horseshoe bat species Rhinolophus pearsoni and Rhinolophus yunanensis (X. Mao, unpubl. data). To identify genes present within the R. pearsoni and R. yunanensis RNA-seq datasets, we performed similarity searches in TBLASTX (Altschul et al., 1997) against the coding gene sequences of the human and the little-brown bat, Myotis lucifugus, from ENSEMBL (Flicek et al., 2014). All searches were conducted with an expected threshold (e-value) of 10−6, and only sequences with a length of 100 bp were retained. In total, this approach yielded the sequences of 12 300 genes.

For bait design, we then identified 3507 coding DNA sequences (CDSs) that were chosen because they were considered to be of interest for future studies of molecular evolution or speciation. In addition, we ensured that each of these target genes showed variant sites by comparing R. pearsoni and R. yunanensis alignments. We also added the CDSs of an additional 13 orthologues obtained from the published genome sequences of the human, M. lucifugus (Flicek et al., 2014) or Rhinolophus ferrumequinum (Bullejos et al., 2000; Davies et al., 2012; Parker et al., 2013; Yoon et al., 2013). Finally, to ensure that all 3520 CDSs (see Supporting information, Table S1) were single copies in the horseshoe bat transcriptomes with no obvious paralogues, we performed reciprocal blasts (Moreno-Hagelsieb & Latimer, 2008) with parameters identical to those described above.

Baits were designed based on bat CDS sequences by taking into consideration exon–intron boundaries as inferred from the human genome. Baits were designed to be 80 nucleotides long, and overlapped by 40 nucleotides (2 × tiling density), yielding 84 550 bait candidates. These candidates were used in BLAST searches (Altschul et al., 1997) against R. ferrumequinum genomic data (Parker et al., 2013). A melting temperature of cross-hybridization was predicted for every BLAST hit outside the target regions. Bait candidates likely to generate cross-hybridization to nonspecific targets were filtered using a proprietary algorithm developed by Mycroarray resulting in 75 817 unique baits. A MYbaits kit containing these biotinylated RNA baits was supplied by MYcroarray.

Tissue collection

For target sequence capture, we obtained tissue samples from ten Rhinolophus species (Table 1): Rhinolophus subrufus, Rhinolophus inops, Rhinolophus shameli, Rhinolophus bocharicus, Rhinolophus clivosus, Rhinolophus virgo, Rhinolophus malayanus, Rhinolophus macrotis, Rhinolophus imaizumi, and Rhinolophus formosae. These species cover six of the 15 recognized species groups based on Guillén-Servent et al. (2003) and were selected to cover a range of phylogenetic distances from the two sister taxa whose transciptomes were used for the bait design (Fig. 1). Briefly, the bait model species belong to the pearsoni-group (R. pearsoni and R. yunanensis), which is sister to the euryotis-group, represented here by R. subrufus, R. inops, and R. shameli. For the purposes of relating capture to phylogenetic distance, we defined these two sister groups (pearsoni and euryotis) (Fig. 1) as having a most recent common ancestor (MRCA) 11 Mya. The [pearsoni, euryotis] clade is then sister to the [megaphyllus, [pusillus, philippinensis]] clade containing three species groups, all of which share a common ancestor with the [pearsoni, euryotis] bait model clade 12 Mya. The most phylogenetically distant species groups from the bait taxa are the ferrumequinum species group (MRCA = 14 Mya) and, finally, the trifoliatus-group (MRCA = 15 Mya).

Table 1. Sample collection, preservation information and summary of sequence data
ID Species Species group Most recent common ancestor Range Museum Collection date Storage media Tissue type DNA extraction concentration (ng μL−1) DNA integrity number (DIN) Number of contigs assembled Percentage unique contigs matching genes Number of genes with sequence coverage ≥ 90% Number of genes recovered
190759 subrufus euryotis 11 Philippines FMNH 2006 DMSO Muscle 2.71 8.5 24361 77.41 1204 3151
206207 inops euryotis 11 Philippines FMNH 2008 DMSO Muscle 6.36 8.5 31212 59.72 1296 3163
23674 shameli euryotis 11 Mainland SE Asia HNHM 1996 96% ETOH Muscle 59 7.3 34123 56.41 1562 3177
16417 bocharicus ferrumequinum 14 Middle East HNHM 1921 70% ETOH Internal organ 2.41 NA 52357 86.55 835 3024
150066 clivosus ferrumequinum 14 Africa FMNH 1992 Flash frozen Heart 16.3 8.9 32545 66.36 1344 3186
168915 virgo megaphyllus 12 Philippines FMNH 2000 DMSO Muscle 3.54 8.1 35926 58.66 1411 3165
22726 malayanus megaphyllus 12 Mainland SE Asia HNHM 2007 96% ETOH Muscle 46.3 8.7 85249 16.61 662 3093
202721 macrotis philippinensis 12 Philppines FMNH 2008 DMSO Muscle 2.13 NA 84271 20.10 956 3141
16382 imaizumi pusillus 12 Japan HNHM 1996 70% ETOH Muscle 0.4 1.9 30167 75.74 1446 3168
21237 formosae trifoliatus 15 Taiwan HNHM 2005 96% ETOH Wing punch 36.9 8.6 30351 69.06 1352 3144
Details are in the caption following the image
Phylogeny of horseshoe bats and their respective species groups used in the present study (based on Guillén-Servent et al., 2003). Phylogenetic relatedness from the bait model species was estimated based on the time to the most recent common ancestor (MRCA) (Mya).

All specimens sampled were held at the Field Museum of Natural History, Chicago, USA (FMNH) or the Hungarian Natural History Museum, Budapest, Hungary (HNHM). The tissues selected encompass a range of conditions expected to impact on nucleic acid integrity, including collection dates that ranged from 1921 to 2007, and different storage conditions that included 70% and 96% ethanol, flash frozen, and dimethyl sulphoxide.

DNA extraction, library preparation, and bait hybridization

Tissue samples were initially cut into approximately 5 mm3 (10–15 mg) pieces and, for those samples that had been stored in media, washed with ddH2O. DNA was extracted using DNeasy Blood and Tissue kits (Qiagen) and eluted with 100 μL of Buffer AE. The DNA concentration was quantified using a Qubit 2.0 Fluorimeter, and subsequently either diluted or concentrated to give standardized 100–200 ng of DNA. Extraction was carried out in a dedicated laboratory area, which was cleaned for exogenous material before and after each extraction to minimize the potential for cross-contamination (Wandeler, Hoeck & Keller, 2007). To obtain information on the level of degradation of each extracted DNA sample, we used a TapeStation (Agilent), which provides a profile of fragment sizes, as well as a ‘DNA integrity number’ (DIN).

For target sequence capture, DNA was first sheared to give a standardized fragment size using a Covaris Sample Preparation System, and then quantified using a High-Sensitivity Bioanalyzer assay (Agilent). Two samples (R. bocharicus and R. imaizumi) (Table 1) were not sheared as a result of low concentrations. Libraries were prepared from the whole sheared DNA for bait capture using the NEB Next Ultra DNA Library Prep kit (New England Biosciences). Illumina adaptors were attached to the sheared DNA, followed by amplification of the libraries prior to bait capture. We performed a polymerase chain reaction with the universal primers, P1 and P2 (Agilent), using the thermocycler conditions in accordance with the manufacturer's instructions.

Targeted capture was performed in accordance with the MYbaits kit protocol (MYcroarray, manual version 1.3.7). The sample libraries were combined with the biotinylated RNA baits and left to hybridize on a thermocycler for approximately 36 h. Following hybridization, the baited libraries were recovered with Streptavidin C1 magnetic beads (MyOne), which attach to the biotinylated beads, and undergo intensive washing to remove nonhybridized and nonspecifically hybridized library molecules. Indexing of samples was performed during post-amplification of the hybridized samples using NEBNext Multiplex Oligos index primers (New England Biosciences).

For sequencing, all libraries were pooled together and run on a MiSeq platform (Illumina) at The Genome Centre (London) to give an estimated total number of 15 million reads per run, representing approximately 1.5 million reads per sample. This was then repeated to generate double the volume of data per sample (30 million reads).

Quantification of bait capture success

For each species, the raw reads from both MiSeq runs were assembled into contigs using the software TRINITY r20140717 (Grabherr et al., 2011). TRINITY will only assemble contigs with a reasonable level of overlap between k-mers. All contigs were then blasted against the original bait model sequences using the same parameters as described above (Altschul et al., 1997). To quantify the number of recoverable genes from sequencing, we recorded the percentage of unique contigs constructed by TRINITY, which were successfully matched to a gene from the bait design (Table 1). Because each locus was recovered as a series of short sequences, to estimate the success of recovering whole genes, we recorded the number of total genes that exhibited ≥ 90% coverage across their entire sequence (Fig. 2).

Details are in the caption following the image
Distribution of target genes recovered in comparison to genetic distance (A), concentration of DNA at extraction (B), collection date (C), and type of storage medium (D). Blue points mark total genes recovered per species, despite partial sequences and red points mark number of genes with at least 90% of their sequence recovered.

Results

DNA extraction

Quantification of DNA extracts indicated wide variation in the sample concentration ranging from 59 ng μL−1 for R. shameli down to 0.4 ng μL−1 for R. imaizumi (Table 1). Similarly, DNA samples ranged in levels of degradation, with most samples showing a standardized fragment distribution consisting almost entirely of very long fragments, resulting in a high DNA integrity number (DIN > 8). Only R. bocharicus and R. imaizumi (which were not sheared) appeared to display a greater distribution of short fragments (< 14 500 bp). Rhinolophus bocharicus (the oldest sample) appeared to only have, almost exclusively, fragments < 1500 bp in length, with most clustering around the 200–400 bp length. Rhinolophus imaizumi showed a wider distribution of lengths, perhaps as a result of the shorter length of time allowed for DNA to fragment, with a slope falling off steadily from a peak of fragments around 200 bp (see Supporting information, Fig. S1).

As might be expected, the oldest sample (R. bocharicus) yielded a low DNA concentration and more recent tissue samples (collected in approximately the last 20 years) gave highly variable results (see Supporting information, Fig. S2). Storage conditions appeared to have some impact on DNA quantity with flash frozen specimens, and samples stored in high concentration ethanol appeared to recover higher concentrations of DNA (see Supporting information, Fig. S2).

Quantification of bait capture success and phylogenetic distance to model species

We found no association between the success of capture and the phylogenetic distance between the model species and the target species (Fig. 2).

Despite the extracted DNA concentration being standardized across all samples prior to bait capture, the number of assembled contigs recovered was highly variable. Six of the species assembled approximately 31 000 contigs, whereas three samples (R. inops, R. bocharicus, and R. malayanus) yielded much higher numbers (up to above 85 249) and one sample (R. subrufus) yielded substantially fewer contigs (approximately 23 000) (Table 1). However, these numbers did not scale well with the number of contigs that could be mapped to the bait sequences; for example, only 16.61% of 85 249 contigs could be identified in R. malayanus compared to 86.55% of 52 357 contigs for R. bocharicus (Table 1). This made the mean number of genes recovered more standardized.

Of the 3520 genes included in the baits, the mean number of genes recovered per sample was 3141 (89.2%), with a range from 3024 (85.9%) for R. bocharicus to 3186 (90.5%) for R. clivosus. However, when considering only those genes with sequence coverage of ≥ 90%, the mean was only 1206.8 (34.3%) and ranged from 662 (18.8%) for R. malayanaus to 1562 (44.4%) for R. shameli (Table 1). Overall, 449 of these genes with high coverage were recovered in all 10 target species.

Discussion

To assess the viability of target sequence capture on museum specimens across a diverse clade, we performed target sequence capture of 3520 genes using a bait designed from RNA-Seq data from two species of horseshoe bat. In total, we were able to recover most (89.23%) of the target genes from multiple, previously unsequenced taxa seperated by up to 15 Myr (Guillén-Servent et al., 2003). We detected no clear pattern of variation in the number of genes recovered in each of the target species in relation to their phylogenetic distance to the ‘bait’ species (Fig. 2). Indeed, we were able to recover most of the genes with at least partial sequence coverage in 10 distinct species, which together span a wide range of the phylogenetic diversity in this group (Guillén-Servent et al., 2003; Stoffberg et al., 2010).

Although the DNA concentration of samples was standardized following extraction, the amount of DNA extracted from museum tissue is often indicative of its degree of fragmentation, and can thus be a proxy for the quality of the DNA. This was confirmed when running all samples on a TapeStation (Agilent). Samples that were low in concentration at extraction appeared to be fragmented into much shorter sequences compared to samples with a high DIN. The DIN is used as a measure of fragmentation, which is used to infer the quality of the DNA (Gassman & McHoull, 2014). It is perhaps important to note that the overall DNA concentration from a museum specimen is not a good correlate of capture success, again as a result of the detrimental affects of preservation (Enk, Rouillard & Poinar, 2013). Simply by omitting the shearing step for the most degraded samples, we were able to reduce any further damage and retain the samples for successful target sequence capture. Most samples used in the present study were recent collections (10–20 years); however, one sample (R. bocharicus) was collected in 1921. Although the date of collection did appear to have some bearing on the number of genes recovered based on results of R. bocharicus (Fig. 2), a wider spread of collection date would be needed to verify this trend. Because R. bocharicus was also stored in a low quality preservation medium, this may also be the reason for its partially lower recovery of genes.

Although some tissues preserved in low quality storage media did recover numbers of genes comparable to high quality storage methods, low quality media appears to be highly variable in its ability. To corroborate this, further testing of high quality storage media would be desirable as a result of the low sample number in the present study. Sample storage media for the tissues used in the present study are also conflated with the date of specimen collection, with older samples tending to be stored in poorer quality media. Future work would ideally look at capture success across a larger sample of older specimens and more recent high quality media preserved specimens.

We used a range of tissues in the present study; however, DNA can be recovered from internal organs (heart, liver, etc.) with favourable results, minimizing the external destructive sampling of specimens and making use of tissues often destroyed in the preservation process (e.g. skin preparation). Even very old tissues (R. bocharicus) and tissues with low quantity and quality (fragmented) DNA (R. imaizumi) were able to be sequenced with comparable success to tissues collected within the past 10 years (R. malayanus) and those flash frozen in the field (R. clivosus), with this latter field preservation method being the most favourable for recovery of nucleic acids.

By contrast to the mean number of genes recovered with high coverage (1207 genes), there was some variation among species and individuals. As a consequence, the number of genes recovered in all 10 species was relatively low (449 genes with ≥ 90% sequence coverage). Apart from the factors considered (e.g. storage media and phylogenetic distance), there are others that may affect the capture success inherent to the short fragment length of Illumina sequencing. Multiple filtering stages and overly short sequence length for contig recovery and BLAST results can negatively affect gene sequence recovery. Minor adjustments to blast parameters, such as sequence length or expected value, may mean important portions of genes are not ‘recovered’ for the final analysis (Altschul et al., 1997; Moreno-Hagelsieb & Latimer, 2008).

The overall success of the capture bait method reinforces its support as a robust technique for use with museum specimens from related species, in this case separated by up to 15 Myr. As is demonstrated by the general lack of distinct patterns of correlation between all measurements taken, there appears to be little to support any single factor affecting the success of bait capture in museum specimens. We did not find any significant correlation with the final number of target genes recovered by the baits. It should be noted, however, that, in the present study, we were unable to account for many external factors, such as the time between collection and preservation, number of freezing/thawing cycles, etc., and, as a result of small sample sizes, we were unable to draw explicit conclusion regarding age of specimen or variety of storage media.

One important concern with utilizing museum-based material is that of contamination from outside sources. DNA from humans, bacteria, fungi, and other specimens (Malmstrom et al., 2005; Miller et al., 2009) can all drastically affect the output of whole genome sequencing because multiple taxa may be sequenced simultaneously. The problems of exogenous DNA may also not be resolved during the analysis stage, especially where short sequences are hard to tell apart, or where data from nontarget taxa have completely swamped the signal of the target species. Although we did not focus on contamination per se, we found no evidence of this based on preliminary analyses of approximately 100 genes. In particular, when our captured gene sequences were aligned with published mammal orthologues and used to build phylogenetic trees, we found that the former almost always formed monophyletic clades to the exclusion of human and M. lucifugus sequences. Here, we found no direct link between the initial DNA concentration at extraction and the number of contigs successfully assembled by TRINITY or between the number of contigs assembled and the number of genes recovered (see Supporting information, Figs S3 and S4).

Our results provide further support for the application of target sequence capture within the field of systematics. Target sequence capture could have particular importance for museum collections by opening up possibilities for incorporating genetic material from type specimens. These specimens underpin the morphological descriptions of almost all recognized species and, as such, serve as benchmarks for field identification and taxonomy (Kingdon, 1997; Csorba et al., 2003), as well as for conservation action (DeSalle & Amato, 2004). Yet, despite this, only a relatively small proportion of type specimens has been sequenced for inclusion in studies of phylogenetics and phylogenomics (Schindel et al., 2011; Staats et al., 2013; Brugler et al., 2014). Notable exceptions include the current efforts by the American Museum of Natural History to obtain COI barcodes from their mammal type specimens (Brugler et al., 2014). The rarity of such initiatives is unfortunate, not least because surprisingly large numbers of field researchers are unable to assign a taxonomic name to their study organisms with complete confidence (Wilson, 2000; Kim & Byrne, 2006; Bennett & Balick, 2014). This is especially true of researchers working on taxa that are cryptic or otherwise morphologically difficult to tell apart (Jones & Vanparijs, 1993; Kingston et al., 2001; Soisook et al., 2008). Sequencing of type material would benefit large numbers of taxonomists, including collectors and museum staff, who are engaged in laborious comparisons involving costly international loans and visits. If more type specimens were sequenced, then verifying the status of new species would be much simpler (Schindel et al., 2011; Brugler et al., 2014) and, effectively, digital through the use of online databases (Rhead et al., 2010; Kasprzyk, 2011; Goodstein et al., 2012). Additionally, it would help settle long-standing taxonomic debates, including the verification of the taxonomic distinctiveness of numerous specimens that were described in the past but are now viewed as questionable (Zhang et al., 2009; Helgen et al., 2013).

To conclude, the present study demonstrates the value of sequence capture for obtaining large volumes of target sequence data from museum specimens including some that show high levels of degradation. Moreover, the method was seen to work well across congeneric species that diverged by as much as 15 Mya. The comparative ease by which this method can recover large numbers of orthologous loci unlocks new opportunities for incorporating museum specimens, including types, into phylogenomic studies (Bi et al., 2013; Nachman, 2013; Tin et al., 2014). In particular, the removal of tiny amounts of material from museum specimens may potentially add enormous additional value to the content of collections.

Acknowledgements

We are very grateful to K. Davies and J. Parker (Queen Mary, University of London) for their guidance and input throughout the bioinformatics stage of our study. We would like to thank C. Mein and his team at the Barts and London Genome Centre for their assistance during the laborartory work and for the use of their facilities, as well as P. Syrris and C. Dalageorgou at the UCL Cancer Institute. We thank D. S. Balete and E. A. Rickart for assistance with collecting specimens in the Philippines, the Protected Areas and Wildlife Bureau of the Department of Environment and Natural Resources for permits, and the Negaunee Foundation and the FMNH's Brown Fund for Mammal Research for financial support. We thank one anonymous reviewer for helpful comments. SEB was supported by a NERC studentship. GC was supported by the Hungarian Scientific Research Fund (OTKA) K112440 and SJR was supported by a Syntax research grant.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.