Volume 48, Issue 8 pp. 2171-2184
RESEARCH ARTICLE
Full Access

Equivalence of citizen science and scientific data for modelling species distribution of birds from a tropical savanna

Eduardo Guimarães Santos

Corresponding Author

Eduardo Guimarães Santos

Programa de Pós-graduação em Ecologia, Instituto de Ciências Biológicas, Universidade de Brasília, Brasília, Brazil

Correspondence

Eduardo Guimarães Santos, Programa de Pós-graduação em Ecologia, Instituto de Ciências Biológicas, Universidade de Brasília, 70919-970 Brasília, DF, Brazil.

Email: [email protected]

Contribution: Conceptualization (equal), Data curation (equal), Formal analysis (equal), ​Investigation (equal), Methodology (equal), Validation (equal), Visualization (equal), Writing - original draft (equal), Writing - review & editing (equal)

Search for more papers by this author
Helga Correa Wiederhecker

Helga Correa Wiederhecker

Independent Researcher, Brasília, Brazil

Contribution: Conceptualization (equal), ​Investigation (equal), Methodology (equal), Validation (equal), Writing - review & editing (equal)

Search for more papers by this author
Leonardo Esteves Lopes

Leonardo Esteves Lopes

Laboratório de Biologia Animal, IBF, Universidade Federal de Viçosa – Campus Florestal, Florestal, Brazil

Contribution: Data curation (equal), Validation (equal), Writing - review & editing (equal)

Search for more papers by this author
Miguel Ângelo Marini

Miguel Ângelo Marini

Departamento de Zoologia, Instituto de Ciências Biológicas, Universidade de Brasília, Brasília, Brazil

Contribution: Conceptualization (equal), ​Investigation (equal), Methodology (equal), Supervision (equal), Writing - review & editing (equal)

Search for more papers by this author
First published: 05 November 2023

Abstract

en

The Wallacean deficit continues to be a challenge to species distribution modelling. Although some authors have suggested that data collected by citizen scientists can be relevant for a better understanding of biodiversity, to our knowledge, no work has quantitatively tested the equivalence between scientific and citizen science data. Here, we investigate the hypothesis that data collected by citizen scientists can be equivalent to data collected by professional scientists when generating species spatial distribution models. For 42 bird species in the Cerrado region we generated and compared species distribution models based on three data sources: (1) scientific data, (2) citizen science data and (3) sample size corrected citizen science data. To test our hypothesis, we compared the equivalence of these datasets. We rejected the hypothesis of equivalence for about one-third (38%) of the evaluated species, revealing that, for most of the species considered, the models generated were equivalent irrespective of the data set used. The distances between centroids of the models that were equivalent were on average smaller than the distances between non-equivalent models. Also, the direction of change in the models showed no pattern, with no trend towards more populated regions. Our results show that the use of data collected by citizen scientists can be an ally in filling the Wallacean deficit gap. In fact, the lack of use of this wide range of data collected by citizen scientists seems to be an unjustified caution. We indicate the potential of using citizen science data for modelling the distribution of species, mainly due to the large set of data collected, which is impracticable for scientists alone to collect. Conservation measures will be favoured by the union of professional and amateur data, aiming for a better understanding of species distribution and, consequently, biodiversity conservation.

Resumo

pt

O déficit wallaceano continua a ser um desafio para a modelagem da distribuição das espécies. Embora alguns autores tenham sugerido que os dados recolhidos por cientistas cidadãos podem ser relevantes para uma melhor compreensão da biodiversidade, pelon osso conhecimento, nenhum trabalho testou quantitativamente a equivalência entre dados científicos e de ciência cidadã. Aqui, investigamos a hipótese de que os dados coletados por cientistas cidadãos podem ser equivalentes aos dados coletados por cientistas profissionais na geração de modelos de distribuição espacial de espécies. Para 42 espécies de aves na região do Cerrado, geramos e comparamos modelos de distribuição de espécies baseados em três fontes de dados: 1) dados científicos, 2) dados da ciência cidadã, e 3) dados da ciência cidadã corrigidos pelo tamanho da amostra. Para testar a nossa hipótese, comparamos a equivalência destes conjuntos de dados. Rejeitamos a hipótese de equivalência para cerca de 1/3 (38%) das espécies avaliadas, revelando que, para a maioria das espécies consideradas, os modelos gerados eram equivalentes independentemente do conjunto de dados utilizado. As distâncias entre os centroides dos modelos equivalentes foram, em média, menores do que as distâncias entre os modelos não equivalentes. Ainda, a direção da mudança nos modelos não mostrou nenhum padrão, sem tendência para regiões mais populosas. Os nossos resultados mostram que a utilização de dados recolhidos por cientistas cidadãos pode ser um aliado no preenchimento da lacuna do déficit wallaceano. De fato, não utilizar esta vasta gama de dados recolhidos por cientistas cidadãos parece ser uma precaução injustificada. Indicamos o potencial da utilização de dados da ciência cidadã para a modelação da distribuição das espécies, principalmente devido ao grande conjunto de dados recolhidos, cujo recolhimento é impraticável apenas para os cientistas. As medidas de conservação serão favorecidas pela união de dados profissionais e amadores, visando uma melhor compreensão da distribuição das espécies e, consequentemente, a conservação da biodiversidade.

INTRODUCTION

One of the most dramatic aspects of the biodiversity crisis is the mismatch between biodiversity destruction and the speed at which species information is produced. Biodiversity loss is happening much faster than species information is produced. There has been an improvement in species distribution modelling in the last 20 years, playing an important role in understanding species spatial distribution patterns. In fact, understanding the distribution patterns of species through time can help researchers and managers intervene in the event of a distribution shift or decline in biodiversity, with direct implications for species conservation (Melo-Merino et al., 2020; Peterson et al., 2011, 2015). Despite this improvement, the Wallacean deficit, that is, lack of knowledge about where species occur geographically (Lomolino, 2004; Proença et al., 2017; Whittaker et al., 2005), continues to be a challenge to species distribution modelling, which in turn limits knowledge of species' geographic ranges. This is because the higher the number of records, the greater the probability of generating adequate predictive models (Feeley & Silman, 2011). Thus, methods that increase our accumulation of biodiversity information, such as data from citizen science, can be an ally of conservationists.

Citizen science is a way of engaging volunteer citizens (mostly non-specialists) with scientific production, including recording data of potential scientific use, such as species occurrence (Auerbach et al., 2019; Heigl et al., 2019). Citizen science began ~1900 (Cohn, 2008) and has advanced worldwide. It is regarded as a promising area, given that it can increase our knowledge of biodiversity (Follett & Strezov, 2015; Hannibal, 2016; Kerstes et al., 2019; MacPhail & Colla, 2020). Albeit not the only objective of citizen science (see Cohn, 2008; Haklay et al., 2021), involving volunteers in the collection of data often increases the volume of data produced. In fact, one of the limitations that has plagued scientists using citizen science data has been the spatial bias associated with human population centres, as most citizen science data is collected in the vicinity of these urban centres (Geldmann et al., 2016). Thus, it is important to understand the positives and limitations related to these large data sets that are being collected around the world.

Citizen science has become an ally for scientists interested in understanding the natural world. In fact, citizen collection has been used to supplement scientific data in many studies. Birds have received the majority of citizen science attention. For example, researchers are improving the understanding of several fields of knowledge such as: community structuring (La Sorte et al., 2018; La Sorte & Somveille, 2020), breeding (Ferreira et al., 2019; Turella et al., 2022), migration (Schubert et al., 2019), conservation (Steven et al., 2019), patterns of occurrence and abundance (Lepczyk et al., 2017) and, more recently, the impact of COVID-19 (Schrimpf et al., 2021). However, although some authors have highlighted that data collected by citizen scientists can be relevant for a better understanding of biodiversity (Chandler et al., 2017; La Sorte & Somveille, 2020; Poisson et al., 2020; Steven et al., 2019), including demonstrating its importance (e.g. Aceves-Bueno et al., 2017; Kosmala et al., 2016), to our knowledge, no work has quantitatively tested the equivalence of sets of data collected by citizen scientists and data collected by scientists.

Here, we used data on 42 Neotropical bird species to investigate the hypothesis that citizen scientists' (CIT) data are equivalent to professional scientists' (SCI) data when generating species geographic distribution models. To compare the equivalence of these data sets, we explore two hypotheses. First, we compared the equivalence of models using an identity test, controlling for the discrepancy in the number of records (H1). After that, to verify the spatial distribution of SCI and CIT models, we compared their overlaps and the centroid shifts of each generated model (H2). Our hypothesis was that models generated only with CIT data would be geographically biased to locations with higher human concentration, due to data collection biases. We then pinpointed the cases in which data from CIT can help fill in the gaps of species occurrence. Finally, we investigated the influence of biological and ecological characteristics on SCI and CIT data equivalence and discussed the implications of our results.

METHODS

Species evaluated

We used 42 species that occur in the Cerrado, a savanna-like biome, which is a biodiversity hotspot that has been intensely threatened by landscape changes due mainly to pasture and agriculture expansion (Klink & Machado, 2005; Myers et al., 2000). The 42 species differ in: extent of occurrence (min = 17 800 km2, max = 71 600 00 km2, www.datazone.birdlife.org); body size (min = 9.5 cm, max = 34.5 cm – Gwynne et al., 2010; Ridgely & Tudor, 2009); conservation status (Least Concern, Near Threatened, Vulnerable, Endangered – www.iucnredlist.org); population trend (decreasing, increasing, or stable – www.datazone.birdlife.org); and habitat-use (grasslands – GR, savannas – SV, and forests – FO) and sensitivity to disturbance (low, medium and high) (Sousa et al., 2021). The 42 species chosen for our analysis have been extensively studied by scientists over the last 20 years in our region (e.g. Lopes, 2008; Marini et al., 2009; Sousa et al., 2021), what has guaranteed reliable scientific data compiled from different databases (for more details see the topic Range delimitation in Appendix S1).

Species occurrence data

We collated CIT occurrence points from three data sources: iNaturalist (n = 1287, research-grade data from www.inaturalist.org), eBird (n = 53 423, www.ebird.org) and WikiAves (n = 47 816, www.wikiaves.com.br). We collated SCI occurrence points (professionally collected) from GBIF (n = 1652, www.gbif.org), ‘Portal da Biodiversidade’ (n = 7781, www.portaldabiodiversidade.icmbio.gov.br), and from scientific articles published until 2021 (n = 5172). We proceeded with the literature review using both the scientific name (including synonyms and previous combinations), and the common name (Portuguese and English) for each species. Also, our review included specimens deposited in 11 Brazilian institutions, six North American institutions, and seven European institutions (to access the complete scientific data review methodology, see Range delimitation in Appendix S1). We collated a total of 117 075 occurrence points (SCI = 14 605 and CIT = 102 470). We obtained, for each species, three distribution model ensembles (SCI, CITcorr and CITuncorr; Table S1).

Species distribution modelling – SDM

We used 19 bioclimatic variables from WorldClim (Hijmans et al., 2005; www.worldclim.org), and also altitude, slope (Amatulli et al., 2018), and slope aspect (Holland & Steyn, 1975). Spatial layers were adjusted to 5 km resolution. We condensed environmental information with a Principal Component Analysis (PCA) (De Marco & Nóbrega, 2018), using 95% of the total variation as a cut-off (Table S2). For each set of points (CIT and SCI) we filtered and removed occurrences that were within a distance of 10 km from each other (2× cell size) (Andrade et al., 2020; Veloz, 2009). Since CIT records are typically more abundant than SCI records (see Table S1), we randomly selected the same number of points from the SCI data as in the CIT data to create the sample-size corrected subset CITcorr. This controlled for the difference in sampling size between the two datasets. We created two comparison scenarios: (1) Total occurrences, using all records of SCI and CIT (SCI × CITuncorr); and (2) Equal occurrences, using a reduced set of CIT records (SCI × CITcorr).

Next, we generated species distribution models using the three data sets (SCI, CITcorr and CITuncorr). We used five widely used algorithms: Maximum Entropy (MXD); Support Vector Machine (SVM); Random Forest (RF); Generalized Linear Models (GLM); and Bayesian Gaussian Process (GAU). We used the bootstrap technique (10 replicates) (Fielding & Bell, 1997), separating 70% of occurrences for model training and 30% for testing. In the cases of models that work with pseudo-absence, we kept a 1:1 ratio with the presence records in each scenario, allocating pseudo-absences in geographic areas with less environmental suitability, predicted by a Bioclim model (Engler et al., 2004). We used True Skill Statistic (TSS) as a performance metric for each model generated (Allouche et al., 2006). To generate the final models for each data set, we assembled the best models, with TSS greater than the overall mean value, hereafter referred to as distribution models. All procedures were performed using the R programming language ENMTML package (Andrade et al., 2020).

Comparison between models

We compared the three distribution models for each species using the identity test (also called equivalence). We performed the pairwise comparison, SCI × CITuncorr and SCI × CITcorr contrasting the model's values for each cell. We considered two similarity metrics, D and I, which range from 0 to 1 (0 = no equivalence, 1 = niche models are identical) (Warren et al., 2008). Afterwards, we performed the hypothesis test based on null model distributions for D and I values derived from 1000 randomized null models for each compared dataset. We tested the similarity, comparing the observed value with the null distribution, with a cut-off limit of p < 0.05 (Warren et al., 2008). Comparisons were performed using the ENMTools package (Warren et al., 2021; Warren & Dinnage, 2021), in the R program (R Core Team, 2020).

We used an equivalence test to classify species' models as either equivalent or not-equivalent. To identify the most relevant variables for distinguishing equivalent from non-equivalent models, we used the selected variables of species in a guided regularized random forest analysis (GRRF, Breiman, 2001), using the ‘RRF’ (Deng, 2013) and ‘randomForest’ packages (Liaw & Wiener, 2002). We used the Mean Decrease Accuracy (MDA) value for evaluation and compared our results using model accuracy, using the confusionMatrix function of the ‘caret’ package (Kuhn, 2021).

To test the spatial differences in the models, using the scientific model (SCI) as a base, we characterized every modelled distribution (CITuncorr and CITcorr) by its range and position (latitude and longitude of the range centroid). Thus, we calculated: (1) the models' range increase for each species, calculating the percentage of areas indicated without intersection with the SCI model. That is, modelled areas that were distant from the model generated with the scientific data; (2) the range shift, calculating the distance and direction of the centroids of the models, using the ‘sp’ (Bivand et al., 2013) and ‘rearrr’ (Olsen, 2023) packages. Our objective was to present a spatial view of the predicted change, testing the hypothesis of change directed to more populated areas, in southeastern Brazil (IBGE, 2022).

RESULTS

Comparison between scientific and citizen science distribution models

All models presented adequate results, with TSS average values greater than 0.5 (Figure 1, Table S3, Figure S1). Mean TSS values obtained by comparing SCI and CIT models were similar between full (CITuncorr) and reduced (CITcorr) citizen science models (SCI × CITcorr, D = 0.655, I = 0.910; SCI × CITuncorr, D = 0.659, I = 0.910, Figure 2, Figure S1).

Details are in the caption following the image
Results from the species distribution models generated in our study. True Skill Statistic (TSS) values and standard deviations of the three generated models. SCI = Model generated with scientific data only; CITcorr = Model generated with data collected by citizen scientists that have been rarefied to equal the number of records of scientific data; and CITuncorr = Model generated with all data collected by citizen scientists. *Represents our hypothesis test, with p-values <0.05 (representing non-equivalence). Colourblind-friendly colour combinations generated with the colorBlindness package (Ou, 2021).
Details are in the caption following the image
Examples of distribution models generated in our comparisons for two species: SCI = Model with scientific data; CITcorr = Model with data collected by citizen scientists that have been rarefied to equal the number of records of scientific data; and CITuncorr = Model with data collected by citizen scientists. We present the suitability map with the occurrence records (blue dots) and the spatial distribution generated (hatched). Top: models indicated as equivalent in our analysis for Helmeted Manakin (Antilophia galeata); Bottom: models indicated as non-equivalent in our analysis for Coal-crested Finch (Charitospiza eucosma).

Comparison of the models revealed that most of the species in our investigation obtained equivalent distribution models. We rejected the hypothesis of equivalence between CIT and SCI distribution models for about one-third (38%) of the evaluated species (p < 0.05), revealing that, for most of the species considered, the models generated (62%) were equivalent irrespective of the data set used. This output was consistent regardless of the number of citizen science records used (CITcorr or CITuncorr). According to Random Forest results, the extent of occurrence, body size and habitat use were the most important variables distinguishing species with equivalent models and species with non-equivalent models (Figure 3).

Details are in the caption following the image
Main results of the equivalence comparison between scientific (SCI) and citizen science (CIT) models. (a) Total occurrences results, using all records of SCI and CIT (SCI × CITuncorr – on the left); and Equal occurrences results, using a reduced set of CIT records (SCI × CITcorr – to the right). The pairwise comparison for scientific and citizen science data sets is represented in the graph according to species size (body size) and extent of occurrence (extent of occurrence) using different symbols indicating; model comparison significance p-value <0.05, threat level and habitat: grasslands – GR, savannas – SV, and forests – FO. (b) Random Forest results reporting the Mean Decrease Accuracy (MDA) indicating variable's importance to distinguish species with significant model equivalency from non-significant distribution model equivalency. Black areas indicate a large cluster of points.

The models indicated as equivalent showed a lower average percentage difference in their spatial distribution (Equivalent SCI × CITuncorr: 20%, Equivalent SCI × CITcorr: 21%) than the non-equivalent models (Non-equivalent SCI × CITuncorr: 32%, Non-equivalent SCI × CITcorr: 28%). Also, the distances between centroids of the models that were equivalent were on average smaller (Equivalent SCI × CITuncorr: 99 km, Equivalent SCI × CITcorr: 105 km) than the distances between non-equivalent models (Non-equivalent SCI × CITuncorr: 214 km, Non-equivalent SCI × CITcorr: 233 km), thus demonstrating that the equivalent models are spatially closer (Table 1). The direction of change in the models showed no pattern, with no trend towards more populated regions, as we had predicted (Figure 4).

TABLE 1. Estimated range size from scientific data (SCI), percentage of range increase relative to SCI, and direction of range shift from SCI under CITcorr and CITuncorr estimated ranges for 42 bird species.
Species Area SCI (km2) CITcorr CITuncorr
Range increase (%) Range shift (km) Direction of range change Range increase (%) Range shift (km) Direction of range change
(°to north) (°to north)
Nothura minor 1 043 182 9 114 31 30 149 25
Taoniscus nanus 1 117 099 31 130 210 22 120 288
Uropelia campestris 3 871 768 15 135 238 15 130 246
Augastes scutatus 144 334 58 282 260 85 153 161
Heliactin bilophus 4 130 344 7 126 263 7 120 269
Campylopterus calcirupicola 115 352 23 103 7 29 75 11
Campylopterus diamantinensis 23 564 20 80 164 20 98 171
Celeus obrieni 947 739 32 22 115 30 20 98
Alipiopsitta xanthops 3 553 351 4 82 256 6 83 267
Pyrrhura pfrimeri 16 027 18 49 271 23 50 290
Herpsilochmus longirostris 2 844 509 26 186 245 28 183 243
Thamnophilus torquatus 3 295 494 12 43 25 16 76 23
Cercomacra ferdinandi 350 698 45 72 272 23 60 227
Melanopareia torquata 3 709 474 29 192 289 29 224 281
Scytalopus novacapitalis 44 646 46 449 155 36 132 177
Geositta poeciloptera 1 477 521 55 280 160 56 430 149
Syndactyla dimidiata 1 308 886 24 82 89 30 80 84
Clibanornis rectirostris 1 842 529 27 26 330 26 83 59
Asthenes luizae 39 441 25 55 274 17 50 253
Synallaxis simoni 3 328 488 19 105 266 11 100 225
Antilophia galeata 2 633 265 5 101 254 5 70 258
Phylloscartes roquettei 228 768 26 112 135 25 100 120
Euscarthmus rufomarginatus 4 921 760 13 158 257 7 150 248
Phyllomyias reiseri 1 618 435 27 230 293 24 233 289
Culicivora caudacuta 3 751 684 8 230 233 12 165 246
Polystictus superciliaris 532 913 25 99 274 55 150 196
Guyramemua affine 3 768 105 22 482 200 21 457 194
Alectrurus tricolor 2 505 625 11 267 271 12 223 278
Knipolegus franciscanus 398 409 27 112 339 30 110 266
Cyanocorax cristatellus 3 336 090 2 119 241 4 91 273
Myiothlypis leucophrys 1 381 891 35 131 300 28 86 213
Charitospiza eucosma 3 374 720 7 284 291 8 308 287
Coryphaspiza melanotis 4 253 439 18 310 335 17 314 331
Embernagra longicauda 330 734 37 32 214 15 26 223
Porphyrospiza caerulescens 3 835 194 5 168 232 8 163 235
Saltatricula atricollis 4 250 500 34 103 281 37 126 293
Conothraupis mesoleuca 162 115 43 177 232 43 112 204
Cypsnagra hirundinacea 3 688 177 30 254 358 27 223 1
Microspingus cinereus 1 155 530 10 153 60 45 150 53
Neothraupis fasciata 3 776 245 24 96 274 24 150 262
Schistochlamys ruficapillus 3 259 912 17 123 105 7 120 229
Paroaria baeri 654 666 50 124 349 33 50 149
  • * Represents our hypothesis test, with p-values <0.05 (representing non-equivalence).
Details are in the caption following the image
Main results when comparing the direction of change in the models generated with data collected by citizen scientists compared to the scientific model. The base of arrows represents the centroid of the SCI model, and the tip of the arrow represents the centroid of models generated with data collected by citizen scientists (CIT). (a) Centroid shift for the uncorrected model (CITuncorr), (b) centroid shift for the corrected model (CITcorr).

DISCUSSION

Our results show that the use of data collected by citizen scientists can facilitate filling the Wallacean deficit gap. In our case, data from citizen scientists represented more than 88% of all records found for the studied species (Figure S1). Furthermore, almost two thirds of the evaluated species presented model equivalence when using SCI and CIT data sets, and the models generated with the different data sets showed a high geographical overlap, as expected, especially in the models that were equivalent. Furthermore, we did not observe bias of the CIT models related to data collection in more populated regions. Our findings demonstrate the high value of CIT data, which cannot be overlooked in light of the current biodiversity crisis and the need to implement effective conservation planning (Butchart et al., 2010; Isbell et al., 2023; Jaureguiberry et al., 2022).

The selection of reliable data sets is challenging and can influence the quality of the distribution model generated (Duputié et al., 2014; Guisan et al., 2017). Thus, caution has been required in the use of citizen science data due to possible identification biases and inaccuracies in occurrence data (Anderson, 2012). Nevertheless, we did not detect a pattern relating to species ecology and whether SCI and CIT models would be equivalent (Figure 3b). A few factors might contribute to that, including specialized amateur birdwatchers' groups that can make remarkable records even of species that are rare and difficult to detect. However, we did not observe bias due to possible collection in more populated regions (Figure 4). So, further investigation on equivalence determinants might reveal which species cannot be satisfactorily recorded by citizen scientists and thus guide species prioritization. Despite that, we argue that data from citizen scientists are an important ally and can be used in distribution modelling. This was supported by testing the species distribution model equivalence, the high geographic overlap of models, and the closeness of centroids between models. Thus, data from citizen scientists are not only useful to model the distribution of most species but also can contribute to other relevant aspects (La Sorte et al., 2018; La Sorte & Somveille, 2020; Zulian et al., 2021) such as biological invasion (Encarnação et al., 2021), phenology and diversity gradients (Soroye et al., 2018; Suzuki-Ohno et al., 2017), conservation (MacPhail & Colla, 2020) and evaluating the impacts of global changes on present and future distribution patterns.

A well-planned citizen science has enormous potential for collecting much-needed data more quickly, to improve species distribution models, and can reach a level of quality that matches the data collected by experts (Hoyer et al., 2012; van der Velde et al., 2017). Our results, however, show that high quality (demonstrated by the equivalence of CIT models) can be achieved without elaborate planning or scientific supervision. Our hypothesis of data bias due to concentrated collection in more populated regions was not corroborated, which confirms our argument for the usefulness of data collected by citizen scientists for species distribution modelling. In fact, one could expect high reliability whenever the requirement is solely the correct identification and location, which is the case of correlative species distribution models (e.g. Hedblom et al., 2014). In addition, some citizen science projects can persist for long periods, as they are not dependent on scarce funding, which limits sampling time (Poisson et al., 2020; Steven et al., 2019; Theobald et al., 2015). At present, it is essential to recognize that most bird species are under some level of human-mediated threat; therefore, scientific effort alone will not be able to gather species data at a desirable speed.

The mismatch between biodiversity information demand and production is even more acute in tropical regions, which harbour a significant number of bird species and lack proper historical efforts to register and share species occurrence data. Moreover, the studied species occur within the Cerrado biome, the world's most biodiverse savanna. In 35 years, more than half of the 2 million km2 biome has been converted to agriculture (Klink & Machado, 2005). These figures reinforce the Cerrado as one of the most threatened biogeographic provinces in the world, urging the use of all available information to support conservation decision-making, including citizen science data. Thus, we emphasize that, regardless of the intrinsic characteristics of the evaluated species (size, area, habitat use, threat level, etc.), the non-use of citizen science data for SDMs is unjustifiably cautious.

Although the two methods of data collection are able to generate statistically equivalent and highly overlapping distribution models, we strongly disagree that scientific data are directly replaceable. It is essential to recognize that scientific and citizen science data have strengths and limitations, and we can derive most benefit by exploring their complementarities (La Sorte et al., 2018; La Sorte & Somveille, 2020; Zulian et al., 2021). Scientific data, such as biological collection vouchers, harbour unique historical information that allows historical changes to be mapped (e.g. Marini et al., 2020; Navarro et al., 2021). In addition, scientific efforts can reach isolated areas within reserves or less populated areas (least visited by citizens), directing data production to redress information gaps. Nevertheless, we point out that part of the species distribution data present in museums worldwide is not available for public consultation, hindering spatial analyses from supporting conservation (Marini et al., 2020; Peterson et al., 2005). Indeed, there is a debate regarding maintenance cost and the resources directed to curators and museums (for further details, see Graves, 2000; Peterson et al., 2005). In contrast, citizen science data may lack standardization and might fail to present detailed information about each specimen, such as geographical coordinates, measurements and sex. Moreover, this can be aggravated for species that need to be closely assessed to determine the identification. Finally, although we are demonstrating a high equivalence of citizen-collected data for generating species distribution models, standardized scientific collection is important and should be preferred.

In summary, we have shown the equivalence of SDMs generated with data collected by scientists and citizen scientists for most species analyses. Although we approached the question using only one set of birds present in the Cerrado region, we indicate the potential of using citizen science data for modelling the distribution of species, mainly due to the large set of data collected, which is impracticable for scientists to collect alone. Understanding the distribution patterns of species is important and will be key to mitigate current population declines of wildlife. Conservation measures will be favoured by the union of professional and amateur data, aiming for a better understanding of species distribution and, consequently, biodiversity conservation.

AUTHOR CONTRIBUTIONS

Eduardo Guimarães Santos: Conceptualization (equal); data curation (equal); formal analysis (equal); investigation (equal); methodology (equal); validation (equal); visualization (equal); writing – original draft (equal); writing – review and editing (equal). Helga Correa Wiederhecker: Conceptualization (equal); investigation (equal); methodology (equal); validation (equal); writing – review and editing (equal). Leonardo Esteves Lopes: Data curation (equal); validation (equal); writing – review and editing (equal). Miguel Ângelo Marini: Conceptualization (equal); investigation (equal); methodology (equal); supervision (equal); writing – review and editing (equal).

ACKNOWLEDGEMENTS

We thank the Brazilian research agency ‘Conselho Nacional de Desenvolvimento Científico e Tecnológico’ (CNPq) and the Brazilian education agency ‘Coordenação de Aperfeiçoamento de Pessoal de Nível Superior’ (CAPES – Finance Code 001) for fellowships.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflicts of interest.

    DATA AVAILABILITY STATEMENT

    Data available in article Appendix S1.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.