Volume 3, Issue 2 pp. 112-126
RESEARCH ARTICLE
Open Access

The implications of estimating rarity in Brazilian reptiles from GBIF data based on contributions from citizen science versus research institutions

基于公众科学和研究机构的贡献从GBIF数据中估算巴西爬行动物稀有性的意义

Implicações da estimativa de raridade de espécies de répteis brasileiros baseada nas contribuições da ciência cidadã versus instituições de pesquisa

Lucas Rodriguez Forti

Lucas Rodriguez Forti

Programa de Pós-Graduação em Ecologia e Conservação, Universidade Federal Rural do Semi-Árido, Mossoró, Rio Grande do Norte, Brazil

Contribution: Conceptualization, Data curation, Formal analysis, ​Investigation, Methodology, Supervision, Writing - original draft, Writing - review & editing

Search for more papers by this author
Jandson Lucas Camelo da Silva

Jandson Lucas Camelo da Silva

Programa de Pós-Graduação em Ecologia e Conservação, Universidade Federal Rural do Semi-Árido, Mossoró, Rio Grande do Norte, Brazil

Contribution: Formal analysis, Writing - original draft, Writing - review & editing

Search for more papers by this author
Eveline Almeida Ferreira

Eveline Almeida Ferreira

Departamento de Biociências, Universidade Federal Rural do Semi-Árido, Mossoró, Rio Grande do Norte, Brazil

Contribution: Formal analysis, Visualization, Writing - review & editing

Search for more papers by this author
Judit K. Szabo

Corresponding Author

Judit K. Szabo

College of Engineering, IT and Environment, Charles Darwin University, Casuarina, Northern Territory, Australia

Correspondence Judit K. Szabo, College of Engineering, IT and Environment, Charles Darwin University, Casuarina, Northern Territory 0909, Australia.

Email: [email protected]

Contribution: Conceptualization, Supervision, Visualization, Writing - review & editing

Search for more papers by this author
First published: 17 June 2024
Citations: 1

The author's institutional affiliations where the work was conducted, with a footnote for the author's present address if different from where the work was conducted.

Editor-in-Chief & Handling Editor: Ashima Campos-Arceiz

Abstract

en

Understanding the distribution of rare species is important for conservation prioritisation. Traditionally, museums and other research institutions have served as depositories for specimens and biodiversity information. However, estimating abundance from these sources is challenging due to spatiotemporally biased collection methods. For instance, large-bodied reptiles that are found near research institutions or in popular, easily accessible sites tend to be overrepresented in collections compared to smaller species found in remote areas. Recently, a substantial number of observations have been amassed through citizen (or community) science initiatives, which are invaluable for monitoring purposes. Given the unstructured nature of this sampling, these datasets are often affected by biases, such as taxonomic, spatial and temporal preferences. Therefore, analysing data from these two sources can lead to different abundance estimates. This study compiled data on Brazilian reptile species from the Global Information Biodiversity Facility (GBIF). It employed a community-ecology approach to analyse data from research institutions and citizen science initiatives, separately and collectively, to assess taxonomic and spatial species coverage and predict species rarity. Using a 1-degree hexagonal grid, we analysed the spatial distribution of reptile communities and calculated rarity indices for 754 reptile species. Our findings reveal that 87 species were exclusively recorded in the citizen science subset, while 212 were recorded only by research institutions. The number of observations per species in the citizen science data followed a Gambin distribution, which aligns with the expected pattern of abundance in natural communities, unlike the data from research institutions. This suggests that citizen science data may be a more accurate source for estimating species abundance and rarity. The discrepancies in rarity classifications between the datasets were likely due to differences in sample size and potentially other sampling parameters. Nevertheless, combining data collected by both research institutions and citizen science initiatives can help to fill knowledge gaps in reptile species occurrence, thus enhancing the foundation for conservation efforts on a national scale.

摘要

zh

了解珍稀物种的分布对保护优先具有重要意义。传统上,博物馆和其他研究机构一直作为标本和生物多样性信息的储存库。然而,由于时空有偏的收集方法,从这些来源估计丰度是具有挑战性的。例如,在研究机构附近或在受欢迎的、容易到达的地点发现的大型爬行动物,与在偏远地区发现的较小物种相比,在收集中的代表性往往较高。最近,通过公民(或社区)科学倡议积累了大量的观测数据,这些数据对于监测目的来说是非常有价值的。考虑到这种采样的非结构化性质,这些数据集通常会受到偏差的影响,例如分类、空间和时间偏好。因此,分析这两种来源的数据可以得到不同的丰度估计。本研究从全球信息生物多样性设施(GBIF)中整理了关于巴西爬行动物物种的数据。它采用群落生态学的方法,分别和共同分析来自研究机构和公民科学倡议的数据,以评估分类学和空间物种覆盖率,并预测物种稀有性。采用1度六边形网格,分析了754种爬行动物群落的空间分布,并计算了稀有性指数。我们的研究发现,有87个物种仅在公民科学子集中被记录,212个物种仅在研究机构中被记录。公民科学数据中每个物种的观测数量服从Gambin分布,这与自然群落中的预期丰度模式一致,而不是来自研究机构的数据。这表明,公民科学数据可能是估计物种多度和稀有度更准确的来源。数据集之间稀有性分类的差异可能是由于样本量和潜在的其他采样参数的差异造成的。尽管如此,结合研究机构和公民科学倡议收集的数据,有助于填补爬行动物物种出现方面的知识空白,从而为国家尺度的保护工作奠定基础。【审阅:杨毅】

Resumo

pt

Conhecer a distribuição de espécies raras é importante para definir áreas prioritárias para conservação. Tradicionalmente, museus e outras instituições de pesquisa têm sido utilizados como acervos de espécimes e outras informações sobre biodiversidade. No entanto, é difícil estimar a abundância a partir dessa fonte de dados, considerando a tendência de métodos de coleta enviesados espacial e temporalmente. Por exemplo, répteis de grande porte que ocorrem próximos a instituições de pesquisa ou em locais populares e de fácil acesso são melhor representados em coleções científicas do que espécies menores localizadas em áreas remotas. Recentemente, um grande número de observações foi acumulado por iniciativas de ciência cidadã que podem ser usadas para biomonitoramento. Dada a amostragem não estruturada, esses conjuntos de dados também são afetados por vieses, como as preferências taxonômicas, espaciais e temporais. Portanto, analisar dados dessas duas fontes pode levar a diferentes estimativas de abundância de espécies. Compilamos dados para espécies de répteis brasileiros do Global Biodiversity Information Facility (GBIF). Utilizando uma abordagem de ecologia de comunidades, analisamos dados de instituições de pesquisa e de ciência cidadã separadamente e em conjunto para descrever a cobertura taxonômica e espacial das espécies e prever a raridade das mesmas. Utilizando uma grade hexagonal de 1 grau, analisamos a organização espacial das comunidades de répteis e calculamos índices de raridade para 751 espécies de répteis. Descobrimos que 87 espécies estavam presentes apenas no subconjunto de dados da ciência cidadã, enquanto 212 foram registradas apenas por instituições de pesquisa. O número de observações por espécie nos dados de ciência cidadã se ajustou à distribuição de Gambin, o padrão esperado para abundância em comunidades naturais, ao contrário dos dados de instituições de pesquisa, sugerindo que o primeiro é uma fonte mais precisa para estimar a abundância e a raridade das espécies. A razão por trás das diferentes classificações de raridade provavelmente se deve a diferenças no tamanho da amostra nos dois subconjuntos de dados e possivelmente também em outros parâmetros de amostragem. No entanto, combinar dados coletados por instituições de pesquisa e iniciativas de ciência cidadã pode ajudar a preencher lacunas no conhecimento sobre a ocorrência de espécies de répteis, fornecendo melhores evidências para a conservação em escala nacional.

Plain language summary

en

Accurately classifying rare species is essential for guiding conservation actions. Traditional data collection for biodiversity, often based on museum specimens originating from expeditions conducted by professional scientists, does not accurately reflect true patterns of species abundance. These efforts are frequently limited by financial constraints and logistical issues that restrict spatial and taxonomic coverage. Sampling is particularly intensive in areas near professionally employed scientists' home institutions, field bases or museums. In contrast, citizen science—where members of the public engage in scientific activities—has revolutionised the way species occurrence data are collected. Over the past two decades, volunteers have increasingly contributed observations from locations around the world that are often overlooked by paid scientists, thereby generating large occurrence datasets. By combining citizen-generated observations with data from research institutions, we can enhance our understanding of the distribution of reptile species across Brazil. Our study reveals differences in the number of observations per species between the two data subsets, with citizen science providing a more accurate indication of species rarity. Therefore, citizen science can broaden our knowledge of species abundance while also supporting more effective conservation actions on a larger scale.

简明语言摘要

zh

珍稀物种的准确分类对于指导保护行动至关重要。传统的生物多样性数据收集往往基于专业科学家进行的科学考察所获得的博物馆标本,并不能准确地反映物种丰富度的真实格局。这些努力往往受到资金限制和后勤问题的限制,这些问题限制了空间和分类的覆盖范围。在专业受雇科学家的家乡机构、野外基地或博物馆附近的地区,采样尤为密集。相反,公民科学-——公众参与科学活动的成员——彻底改变了物种发生数据的收集方式。在过去的二十年中,志愿者越来越多地贡献了来自世界各地的地点的观测,而这些地点往往被有偿科学家忽略,从而产生了大量的事件数据集。通过将公民生成的观察与研究机构的数据相结合,我们可以增强对巴西各地爬行动物物种分布的了解。我们的研究揭示了两个数据子集之间每个物种的观测数量的差异,公民科学为物种稀有性提供了更准确的指示。因此,公民科学可以拓宽我们对物种丰富度的认识,同时也可以在更大范围内支持更有效的保护行动。

Resumo em linguagem simples

pt

Uma classificação com precisão para espécies raras é essencial para informar as ações de conservação. Embora úteis para avançar no conhecimento da biodiversidade, os dados de presença baseados apenas em espécimes de museu, originados de expedições conduzidas por cientistas profissionais, muitas vezes não refletem os padrões naturais de abundância, pois questões financeiras e logísticas limitam a cobertura espacial e taxonômica da amostragem. A amostragem é particularmente intensiva em áreas próximas às instituições de origem dos cientistas profissionais ou às bases de campo, levando a uma amostragem tendenciosa. A ciência cidadã, na qual o público contribui com observações, revolucionou a forma como coletamos dados de ocorrência de espécies. Nas últimas duas décadas, voluntários têm contribuído com um número cada vez maior de observações de locais ao redor do mundo que geralmente não são amostrados por cientistas remunerados, gerando grandes conjuntos de dados de ocorrência. Integrar observações da ciência cidadã com dados fornecidos por instituições de pesquisa melhora nosso conhecimento sobre a distribuição de espécies de répteis no Brasil. Descobrimos que os dois subconjuntos de dados tinham um número diferente de observações por espécie e que a ciência cidadã poderia indicar com mais precisão a raridade das espécies do que os dados fornecidos por instituições de pesquisa, destacando seu potencial para aprimorar nosso conhecimento sobre a abundância de espécies e informar as ações de conservação em uma ampla escala.

Practitioner points

en

  • Integrating data from citizen science initiatives with those from research institutions can enhance our understanding of reptile species distribution and richness.

  • Citizen science data can be used to determine patterns of species abundance and rarity for Brazilian reptiles.

  • Reptiles that are classified as Data Deficient or Not Evaluated and show high rarity values should be prioritised for assessment by the International Union for Conservation of Nature (IUCN).

实践者要点

zh

  • 整合来自公民科学倡议和研究机构的数据,可以增强我们对爬行动物物种分布和丰富度的理解。

  • 公民科学数据可用于确定巴西爬行动物的物种丰富度和稀有性模式。

  • 被归类为“数据缺乏或未评估“和“稀有值“较高的爬行动物应被国际自然保护联盟(IUCN)优先评估。

Pontos para profissionais

pt

  • Integrar dados fornecidos por iniciativas de ciência cidadã à dados de instituições de pesquisa pode aprimorar nosso conhecimento sobre a distribuição e riqueza de espécies de répteis.

  • Os padrões de abundância e raridade das espécies podem ser calculados a partir de dados de ciência cidadã para répteis brasileiros.

  • Espécies com dados deficientes e não avaliadas globalmente, com valores elevados de raridade, devem ser priorizadas para avaliações da IUCN.

1 INTRODUCTION

In the face of the current global biodiversity crisis, understanding species distributions and population sizes is increasingly critical (Hortal et al., 2015; Nori et al., 2023). Data on species distribution can elucidate the environmental requirements of a species by examining its fundamental niche (Kearney, 2019; Takola & Schielzeth, 2022). When a species is suspected to be declining, these data are key for informing conservation actions (White et al., 2023). Many studies have focused on this issue to inform decision-making in this area (Kondratyeva et al., 2019; Loiseau et al., 2020). The need to systematically assess the condition of wildlife populations and related threats on a global scale was first recognised in the 1960s (Mace, 1994), leading to the publication of the IUCN Red List of Threatened Species in 1991. This list is based on the systematic assessment of species extinction risk (Mace et al., 2008). Currently, taxa are assigned into one of eight categories (from Least Concern to Extinct) based on geographic range, population trend, size and structure, as well as their temporal trends (IUCN Standards and Petitions Committee, 2022). Several of these criteria are linked to population size and focus on aspects such as decline (criterion A), ongoing decline or extreme fluctuations (B and C), and range, including severe fragmentation (B) and the extent of occurrence or area of occupancy (B and C; IUCN Standards and Petitions Committee, 2022). Nevertheless, a large number of species worldwide have yet to be evaluated, particularly in megadiverse countries, due to the absence of data, restricted data access, and inaccuracies in datasets concerning population size and geographic distribution (Hochkirch et al., 2020). Furthermore, insufficient funding often drastically limits the rate of such assessments (Juffe-Bignoli et al., 2016; Rondinini et al., 2014). Conservationists need efficient analytical tools to provide evidence of high extinction risks and to prioritise actions that reduce the number of Not Evaluated and Data Deficient species. Species that are poorly understood are frequently overlooked when allocating conservation resources (Woinarski et al., 2021). Considering that species most vulnerable to extinction are often naturally rare (Harnik et al., 2012), identifying such species can serve as an indicator of potential threats and can flag species as conservation priorities (Gauthier et al., 2010; Veach et al., 2017).

Although rarity is often intuitively interpreted as a species having few individuals, it is, in reality, a complex, multidimensional concept. To address this complexity, Rabinowitz (1981) proposed seven types of rarity based on three properties: (1) geographic distribution (widespread vs. narrow-ranged species), (2) habitat specificity and (3) local population size (abundance-based rarity), along with their various combinations. For instance, a rare species might be characterised by a small population that is widely dispersed geographically or by an abundant population restricted to a limited habitat area (Rabinowitz, 1986). This classification framework is still frequently used, for instance, for plants at regional (Quiroga & Souto, 2022) and national levels (Choe et al., 2019) and for deep-sea bivalves (McClain, 2021), serving as a base for further theoretical work (Maciel, 2021). Under nonexperimental conditions, the uneven distribution of individuals among species is a common pattern observed in biological communities (Magurran, 2004). Typically, a few species are dominant while many others are rare, causing the abundance distribution curve for communities to fit a near-logarithmic shape (McGill et al., 2007). This abundance distribution pattern can be considered a natural power law for biodiversity datasets and can be explained by the different abilities of species to access resources in a given space (Marquet et al., 2007). However, inaccurate or incomplete knowledge of species rarity can lead to erroneous allocation of resources or impede conservation actions (Dibner et al., 2017; Katzner et al., 2011). This is particularly concerning in countries with high biodiversity that are often poorly inventoried. Therefore, there is a growing demand for increased biodiversity surveying, especially in nations that harbour global biodiversity hotspots facing threats from escalating deforestation pressure and the effects of climate change (Habel et al., 2019; Kong et al., 2021). In response to this urgent demand, more efficient methods need to be adapted to aid in the understanding of biodiversity distribution and trends across extensive geographic scales. Data produced by citizen science have been making substantial contributions to biodiversity monitoring at the global scale (Chandler et al., 2017; Johnston et al., 2023; Mesaglio et al., 2023). These efforts have also influenced public policy (Fritz et al., 2019; Roger et al., 2023) and raised awareness among both the public and policymakers (Danielsen et al., 2014). With the widespread adoption of internet-enabled smartphones and the development of user-friendly applications to record biodiversity, members of the public, working in a nonprofessional and unpaid capacity, have been documenting the location of species worldwide (Deacon et al., 2023; Pocock et al., 2024; Tulloch, Possingham, et al., 2013). These records are often accompanied by photographic and video documentation, providing valuable secondary information (Klinger et al., 2023; Pernat et al., 2024). Nevertheless, the spatial adequacy (Backstrom et al., 2024) and overall quality of citizen-science data are highly variable, influenced by the heterogeneous behaviour of the observers (Callaghan, Poore, Hofmann et al., 2021; Pocock et al., 2023) and the accuracy of observations (Gorleri & Areta, 2022; Gorleri et al., 2023). Spatiotemporal biases, particularly those related to observer behaviour, can be identified by comparing unstructured (or semi-structured) data with results from structured surveys (Balázs et al., 2021; Szabo et al., 2012). This comparison facilitates the calibration of different datasets, making them suitable for trend analysis and enhancing their reliability (Forti et al., 2024; Hertzog et al., 2021). In fact, the integration of citizen-science data and structured surveys has been shown to offer effective complementary insights (Dimson et al., 2023; Robinson et al., 2020; Tulloch, Mustin, et al., 2013).

The Global Biodiversity Information Facility (GBIF) collates georeferenced species occurrence data from a variety of sources or research institutions, including academic institutions, government research facilities, museums, herbaria, as well as various fauna and flora inventories (henceforth referred to as RI data). A second stream of data originates from citizen-science initiatives (henceforth referred to as CS data). GBIF currently hosts over 1.5 billion records for taxa from around the globe (gbif.org). Despite this extensive database, GBIF data are not without limitations, some of which are inherent to the database itself (i.e. biases originating from the amalgamation of datasets collected through different methods), and others that stem from the incoming data, such as taxonomic bias, identification errors and incorrect or missing geographic coordinates (Petersen et al., 2021; Troudet et al., 2017). For smaller datasets, some of these issues can be mitigated by manual data cleaning, while for larger ones, automated filtering techniques are applied (Zizka et al., 2020). Additionally, the quality of different taxonomic and geographical subsets varies significantly (Szabo et al., 2023). In spite of these limitations, GBIF provides a viable alternative to conducting surveys of ecological communities at large spatial scales, particularly in biodiverse countries with limited scientific knowledge and resources (Heberling et al., 2021; Ivanova & Shashkov, 2021).

In regions, such as Europe, the United States and Australia, the volume of data generated through citizen science has significantly contributed to the assessment of population trends across various taxa, including snakes (Santos et al., 2022), bats (Barlow et al., 2015) and birds (Fink et al., 2020; Szabo et al., 2010). On the other hand, in Brazil, the integration of citizen science into biodiversity research is still relatively nascent. Despite this, certain taxonomic groups, particularly birds and amphibians, have been receiving disproportionately high interest (Forti & Szabo, 2023). Citizen science has notably advanced our understanding of reptile distribution in Brazil, discovering previously unknown areas of occurrence (Oliveira et al., 2023). Yet, in spite of these advances, citizen science data have not been formally incorporated into decision-making processes in the country. An important initial step in leveraging these data for conservation and policy-making is to evaluate whether observations from citizen science can accurately reflect population sizes and species distributions and whether they can reliably classify species as ‘rare’ or ‘common’.

Reptiles represent one of the most diverse animal groups on the planet, and Brazil stands as a significant biodiversity hotspot, holding the third-highest diversity of reptiles globally with 856 species; a number that continues to rise as new species are described each year (Guedes, Entiauspe-Neto, et al., 2023). Within these species, there is a broad range of environmental tolerances. Some species exhibit wide environmental tolerance, enabling them to coexist in diverse habitats, while others, with specialised habits, can only survive within narrowly defined environmental conditions and thus have restricted distributions (Birskis-Barros et al., 2019). In spite of their diverse adaptations, reptiles are threatened worldwide due to habitat loss caused by expanding agriculture, deforestation and urban development, as well as illegal exploitation (Böhm et al., 2013). Consequently, 23.5% of reptile species are at some risk of extinction globally, with the figure standing at 14.4% in Brazil (IUCN, 2024). A high proportion (73%) of Near Threatened and threatened Brazilian species are classified under Criterion B concerning geographical range and population trends (IUCN, 2024).

Reptiles are integral to diverse trophic interactions within ecosystems. Through their activities as grazers, browsers, apex predators and scavengers, they play important roles in trophic networks, facilitating the balance and functioning of these systems (Pinto-Coelho et al., 2021). Beyond these roles, reptiles also contribute to other ecological processes, including seed dispersal, pollination, nutrient cycling and ecosystem engineering by creating habitats such as burrows and pools, which serve other species (Miranda, 2017). The socioeconomic importance of reptiles is equally notable. They contribute to tourism (Cohen, 2019), their bioactive compounds are used in pharmacological research (Mishra et al., 2020), and in Brazil, they also serve as a protein source for rural communities (Cajaiba et al., 2015). Considering their ecological and socioeconomic roles, coupled with the threats they face, the conservation of reptiles should be prioritised, particularly in tropical countries where biodiversity is rich, and the impacts of biodiversity loss can be profound (Miranda, 2017).

Data deposited in research institutions often originate from localised studies that focus on one particular species or on a small number of related species. Due to variations in the design and aims of these studies, certain species may be overrepresented, while many others remain neglected, leading to potential biases in the data collected (Meineke & Daru, 2021). In contrast, citizen science initiatives often employ gamified apps to motivate volunteers to collect observations of a diverse array of species across larger geographic scales (Callaghan, Poore, Mesaglio, et al., 2021; Sandbrook et al., 2015). As representation often reflects availability, rare or less detectable species will have fewer observations than dominant and conspicuous species (Johnston et al., 2018). Considering that RI and CS biodiversity data are generally collected using differing methodologies (aims, design, and scale), we hypothesised that the two datasets would yield different relative species abundance estimates. In particular, we predict that CS data can provide more accurate estimations of species relative abundance than data contributed by research institutions. We test this hypothesis through a community ecology approach, using the number of observations (records) as a proxy for species abundance in local communities to compare species abundance and rarity estimates based on reptile occurrence data from Brazil, as presented in GBIF, contributed by CS versus RI. We also discuss the limitations of GBIF data, particularly in relation to the biases associated with these two types of data contribution.

2 METHODS

2.1 Data collection and organisation

We downloaded data on reptile occurrences in Brazil from GBIF (https://doi.org/10.15468/dl.j7ajhx) on 30 October 2023. Following the taxonomy and species distribution in the Reptile Database (http://www.reptile-database.org/), we conferred species names and their classification as native or exotic to Brazil, considering them exotic if the country was not listed in their native distribution. To ensure data accuracy, we removed observations that displayed taxonomic inconsistencies (e.g. non-recognisable synonymy). We also eliminated duplicate observations of the same species that occurred at the same geographic location on the same day using the distinct function of the dplyr package (Wickham et al., 2022) in R version 4.2.1 (R Core Development Team, 2022).

We classified observations as originating from CS when the institutional code was (1) BioDiversity4All, (2) Diveboard, (3) iNaturalist and (4) naturgucker. All other observations were designated as RI data. To organise the data spatially, we used the geographical coordinates of each observation and aggregated them using a hexagonal grid generated over the extent of Brazil (in angular geographic coordinates—EPSG 4674). The grid was configured with a horizontal and vertical spacing of 1 degree each, resulting in 1188 grid cells. We overlapped reptile records from GBIF with the grid using QGIS v. 3.28.5 (QGIS Development Team, 2021). Using these spatial units, we assigned a unique grid cell ID to each reptile observation. We adopted a community ecology approach, defining each grid cell as a local community based on observations submitted from the same grid cell. Within this framework, the number of observations per species in each community was used as a proxy for the relative abundance of a particular species (Callaghan et al., 2024). After this spatial organisation, we excluded non-continental observations (i.e. those submitted from oceanic locations or from oceanic islands) when calculating rarity indices. Therefore, we did not calculate rarity indices for three island endemic species (Amphisbaena ridleyi, Bothrops insularis and Bothrops sazimai). Finally, to assess the representativeness of observations across different biomes, we overlaid species occurrence data with a layer representing the six major Brazilian biomes: Amazonia, Atlantic Forest, Caatinga, Cerrado, Pampa and Pantanal, according to the Brazilian Institute of Geography and Statistics (IBGE, 2019). This analysis enabled us to determine the frequency of each species within these biomes.

We prepared the data for analysis by compiling three ecological matrices, with species represented as columns and grid cell IDs (local communities) as rows. We created separate tables for the two subsets: one comprising observations from citizen science initiatives and another containing observations from professional researchers only. We also prepared a combined table using the full data set. We estimated the probability of each species being classified as common or rare based on the species abundance and number of grid cells occupied. Typically, a rare species is expected to have a low number of observations (low abundance) and to be absent from most local communities. We used the fuzzy clustering algorithm from the FuzzyQ package to quantify community-level coherence in the classification of species into common and rare clusters (Balbuena et al., 2021). This method simultaneously evaluates the dissimilarities in occupancy and abundance, producing indices of commonness (Ci) and rarity (Ri). These indices are derived from dissimilarity indices that reflect the probability of a given species, denoted as species i, being categorised as common and rare, respectively (Gower, 1971).

Having obtained fuzzy quantification for each species present in each subset, we proceeded to identify species that were allocated to different clusters. Next, using cell ID, species name, geographical coordinates, biome and abundance per grid cell, we calculated four Rabinowitz rarity indices (GRI, HSI, PSI and RR) for each species across the two subsets and the full data set. These calculations were conducted using the rrindex package (Maciel, 2021). These indices are based on three dimensions of rarity: geographic range index (GRI), habitat specificity index (HSI) and population size index (PSI). We considered the number of biomes occupied by each species as a measure of habitat specificity and the absolute number of observations per grid cell as an indicator of species abundance. The fourth index calculated using this package was the rarity index (RR), which is the central axis of these dimensions, representing a synthesis of the three other indices calculated as RR = med(GRI+HSI+PSI) (Maciel, 2021).

We compiled data on global threat status (IUCN, 2024) using the rredlist package (Chamberlain, 2020). We cross-referenced these data with entries in the GBIF database and directly consulted the IUCN Red List website for taxa the package could not categorise (https://www.iucnredlist.org/). The five criteria used by IUCN are based on geographical range and population size (IUCN Standards and Petitions Committee, 2022). Generally, species that are more threatened are also rarer than those classified as Least Concern. For the purpose of this study, we classified the threat status of each species as (1) non-threatened (i.e. Least Concern) or (2) Near Threatened and threatened (Vulnerable, Endangered, and Critically Endangered). We included four species classified as Lower Risk/Near Threatened and two Lower Risk/Conservation Dependent species in the second category. Unfortunately, we were unable to find information for 177 species, and 21 were categorised as Data Deficient. These species were excluded from the threat category calculations. To represent the relationship between rarity and commonness visually, we plotted rarity-commonness indices on a scatterplot using the full data set. We computed species completeness based on the latest list of Brazilian reptiles (Guedes, Entiauspe-Neto, et al., 2023).

2.2 Data analysis

We compared the proportion of common species in the two (CS and RI) subsets using a χ2-test through the chisq.test function in R. To assess the consistency of species classification between the two subsets and the full data set (see Table S1), we performed multiple correlation analyses. We used the Spearman method through the cor.test function in R to evaluate the relationships between the two subsets and the full data set, focusing on the commonness index and the four Rabinowitz rarity indices.

As an alternative evaluation of the quality of the two subsets for patterns of species abundance, we tested their Gambin model distribution fit. Gambin is a stochastic model that mixes gamma distribution with a binomial sampling method (Matthews et al., 2014). According to empirical tests, Gambin distribution provides a superior fit to species abundance distributions when compared to other classic models (Ugland et al., 2007). Therefore, Gambin distribution is very useful in describing ecological communities with species abundance curves represented by common species and a long tail of rare species, a pattern frequently observed in nature. We used the fit_abundances function of the gambin package (Matthews et al., 2014) to test the fit of species abundance patterns to the Gambin model for both subsets. This method provides an α-value, a parameter also used as a diversity metric reflecting the complexity of a community's interactions with its environment (Ugland et al., 2007). Based on the logic that threatened and Near Threatened species are rarer than Least Concern species, we used two generalised linear models (GLM) to test whether the indices of commonness differ between Least Concern versus Near Threatened and threatened species for citizen- versus professional-collected data. This was done by applying the glm function with Gamma distribution as a link function after assessing data distribution.

We checked spatial bias in the two subsets by calculating the number of observations (considering all reptile species) per subset. We then analysed the difference in observation counts between the two subsets for each grid cell. We tested for spatial bias by calculating the expected number of observations per biome based on the proportional size (in km2) of the biome. We compared the expected counts to the observed numbers using the chisq.test function in R.

To analyse species diversity within the identified communities, we constructed rarefaction curves based on the two subsets and the full data set. These curves were generated using the specaccum function and employing the rarefaction method provided by the vegan package in R (Oksanen et al., 2020). We used these curves to define whether species diversity had satisfactory coverage at the national scale, given that a clear asymptote (an identifiable trend line with no change in direction) indicates that further sampling is unlikely to yield additional species, thereby affirming satisfactory species coverage at the national scale. We correlated species diversity based on the two subsets among grid cells using Spearman correlation implemented through the cor.test function in R. We produced graphs using the R base plot function and the ggplot2 package (Wickham, 2016).

3 RESULTS

3.1 Overview of reptile GBIF data from Brazil

Based on GBIF data, we identified 754 reptile species within Brazil's geographical boundaries. After the exclusion of duplicates and potential misidentifications, these species were represented by 42,580 observations, covering 82.6% of the total species recognised as native to Brazil. Among the 43 families recorded in the database, Colubridae had the highest representation (13,879 observations), while nonnative families (Acrochordidae, Boyeriidae, Chamaeleonidae, Phrynosomatidae, Platysternidae, Psammophiidae and Varanidae) were represented by a single observation each. The database was mainly composed of Squamata (91%), while Testudines and Crocodylia accounted for only 5% and 4%, respectively.

The number of observations has increased substantially since the 1980s, particularly in the last 20 years, reaching a peak in 2022 (Figure 1). Historical data revealed the oldest observation from a research institute dates back to 1880, featuring a South American ground lizard (Ameiva ameiva) in the city of São Paulo. The earliest record from citizen science was a House gecko (Hemidactylus mabouia) logged in 1970 in the city of Rio de Janeiro.

Details are in the caption following the image
Number of reptile observations in Brazil in the Global Biodiversity Information Facility between 1800 and 2023, comparing the amount of citizen science data (green) and observations by professional scientists (orange).

There were 24,828 observations (58.3%) and 17,756 observations (41.7%) in the RI and CS subsets, respectively. iNaturalist was the largest source of citizen-science data (99.2%). The Argentine black and white tegu (Salvator merianae) was the most observed species within the CS dataset (1464 observations), while the Amazon lava lizard (Tropidurus torquatus) had the most (1388) RI observations.

Native species accounted for 35,199 observations, including 5683 observations of 316 species endemic to Brazil. Among these endemic species, the Neotropical lava lizard (Tropidus hispidus) had the highest number of observations (644). We also identified 66 exotic species in the database, represented by 1702 observations. The most common exotic species was the House gecko, with 1324 observations.

Considering the full data set (i.e. 436 species that were reported in both subsets), we classified 279 species as rare (Table S1). Ridley's worm lizard (Amphisbaena ridleyi), Noronha skink (Trachylepis atlantica) and the Endangered Calango (Tropidurus psammonastes) had the highest rarity values with regard to geographical range criteria (GRI), while the general values of rarity (RR) were highest for the Endangered Dumeril's worm lizard (Leposternon octostegum), and two other lizard species: Caparaonia itaiquara and Tropidurus pinima. Ameivula mumbuca, a species of teiid lizard, had the lowest commonness index (Ci).

3.2 Comparing species rarity between the two subsets

Among the 1188 hexagonal units of the grid, 308 contained no recorded reptile observations in the GBIF database. We identified a strong negative correlation between rarity and commonness indices across the full data set (ρ = –0.7564968, p-value < 2.2 ∗ 10−16). While most threatened and Near Threatened species had higher RR and lower Ci values, some Near Threatened or threatened species were still classified as common based on the evaluation of the full data set (Figure 2a). Although the proportion of common species did not significantly differ between the two subsets (28% and 34%, respectively; χ2 = 1.7603, p-value = 0.1859), 81 species were classified differently based on the two subsets (Table S1). Furthermore, the coefficients for both the correlation of the rarity index and the commonness index between the two subsets were <0.70, and the index values differed for many species between the subsets (Figure 2b–f). Additionally, correlation analyses of the indices that made up the Rabinowitz rareness dimensions (GRI, HSI, PSI and RR) between the subsets showed that the highest disparity was related to the population size index, which had the lowest correlation coefficient (ρ = 0.3324617; p-value = 9.626 ∗ 10−12), followed by the commonness index (ρ = 0.6491; p-value ≤ 2.2 ∗ 10−16), rarity index (ρ = 0.6656551; p-value ≤ 2.2e ∗ 10−16), habitat specificity index (ρ = 0.7155195; p-value ≤ 2.2 ∗ 10−16) and geographic range index (0.7700875; p-value ≤  2.2 ∗ 10−16).

Details are in the caption following the image
(a) Relationship between rarity and commonness indices for Brazilian reptile species based on the Global Biodiversity Information Facility database; black dots represent Least Concern species, red dots represent Near Threatened and threatened (VU, EN and CR) species, and grey dots are nonclassified species. The other figures correlate indices calculated from citizen science (on the x axis) and professional-collected datasets (on the y axis) for (b) habitat specificity index (HSI), (c) geographic range index (GRI); (d) population size index (PSI); (e) rarity index (RR) and (f) commonness index (Ci) values. The red dashed isoclines at 0.5 on the last graph delimitate rare (<0.5) and common (>0.5) species.

While RI data did not fit the Gambin distribution (AIC = 2837.02; α = 1.250859 to 5.550947; χ2 = 17.344; df = 5; p-value = 0.004), CS data had a good fit (AIC = 2224.811; α = 1.836133 to 4.833614; χ2 = 8.594; df = 5; p-value = 0.126). The commonness index (Ci) for threatened and Near Threatened species was consistently lower than for Least Concern species in both subsets, with statistical significance in both the CS (AIC = –226.1, t-value = –2.246, p = 0.0252, n = 406) and RI data (AIC = –72.722, t-value = –3.083, p = 0.00219, n = 406).

3.3 Spatial bias and species completeness in the two subsets

Data contributed by RI contained 81% of the total number of species reported from Brazil, while CS data contained only 63%. The former provided more observations, particularly from the south and southeast of the country, while the latter had a larger contribution in the northeast (Figure 3). Both subsets provided good species coverage from the central-western region. Species richness calculated from the two subsets did not show a significant correlation at the grid cell level (ρ = 0.069; p-value = 0.086).

Details are in the caption following the image
Spatial distribution of reptile observations in Brazil based on the number of observations from citizen science and traditional survey data subsets of the Global Information Biodiversity Facility (GBIF) database.

The Atlantic Forest was the best-represented biome, with 15,543 observations, while the Pantanal only had 2109 observations. A total of 3929 observations were submitted from oceanic islands. The distribution of observations per biome differed significantly from the expected values, which were calculated based on the size of the area (χ2 = 318,459, df = 5, p-value < 2.2 ∗ 10−16). The Amazon, Caatinga and Cerrado were underrepresented, with standardised residues of –178.71847, –11.72771 and –36.50577, respectively, while the Atlantic Forest, Pampa and Pantanal were overrepresented, with standardised residues of 536.59507, 149.28487 and 51.20920, respectively.

Considering the two subsets, 212 species were exclusively observed in the RI subset, while 87 species were unique to the CS subset. Thus, with 695 species, species richness was estimated to be higher in the professional scientist subset compared to the citizen scientist subset (542 species). Neither of the rarefaction curves for the two subsets nor that of the full data set presented a well-defined asymptote (Figure 4).

Details are in the caption following the image
Comparison of species richness interpolation curves of reptiles in Brazil considering the full Global Biodiversity Information Facility data set (in purple), and the two subsets, traditional survey (in orange) and citizen science data (in green).

4 DISCUSSION

While traditional surveys conducted by experts are indispensable for advancing knowledge of taxonomy, biology and geographical coverage of species, RI data, such as those available in natural history collections, usually do not suffice to determine species abundance patterns or rarity at larger scales. This limitation largely stems from inherent biases associated with particular research objectives of professional researchers or logistical constraints (Isaac & Pocock, 2015). For example, many taxonomists go to remote places to describe new species from poorly studied regions (Brito et al., 2021; Kennedy et al., 2019), focusing predominantly on collecting potentially new taxa. Similarly, population ecologists may concentrate on monitoring and sampling specific species, often overlooking others present at the study site. Such biases in RI data can lead to an overrepresentation of certain species while neglecting others. In the context of Brazilian reptiles, research attention is often skewed towards larger species, particularly those whose geographic ranges coincide with the locations of institutions housing experts (Guedes, Moura, et al., 2023). As a result, observations of specific species disproportionately contribute to natural history collections, independent of the actual abundance or distribution range of these species.

In contrast, CS data, despite being known to have spatiotemporal (Bowler et al., 2022; Di Cecco et al., 2021) and species traits biases (Callaghan, Poore, Mesaglio, et al., 2021; Marcenò et al., 2021), typically provide a more comprehensive picture of species abundance and distribution range. These factors significantly influence species detectability among observers, affecting the number of observations each species may have (Szabo et al., 2012). Our findings suggest that CS data provide more accurate estimates of species abundance or rarity than RI data. Nevertheless, neither data set reflects “real” abundances perfectly. The disparity in rarity classification between the two subsets is likely due to the limited capability of RI data to estimate population sizes accurately. This conclusion is supported by the observation that, unlike CS data, RI data did not fit the Gambin distribution, which represents the expected natural pattern for species abundance (Matthews et al., 2019). While geographic range data are well-documented for most terrestrial vertebrates, obtaining accurate population size data remains challenging for many reptile species (Ficetola et al., 2018). This issue is reflected in the relatively high number of reptile species classified under criterion B and in the relatively high correlation between the two datasets with regard to the geographic range index but not for the population size index.

Although the use of observations from CS appears to offer fewer constraints for inferring species abundance compared to data from RI, we need to highlight the potential risk of false negatives (species that were present but remained undetected) and false positives (misidentifications or recording species that were in fact absent) related to data from citizen science initiatives (Gorleri et al., 2023; McDonald & Hodgson, 2021). Despite these challenges, integrating data from various sources is known to improve data quality (Brown & Williams, 2019; McDonald & Hodgson, 2021). Nevertheless, more evidence from empirical tests using robust estimators of population size among species is necessary to validate whether integrating data sources can improve abundance data for Brazilian reptiles.

The integration of different data sources has improved our understanding of species richness and taxonomic coverage in our data set, highlighting the benefits of spatial complementarity. Despite these advances, the rarefaction curve for the full data set did not reach a satisfactory asymptote. The RI data set provided a more extensive list of species, with 212 (28% of all Brazilian reptile species in GBIF) species that were exclusive to this data set. For instance, the only records of the colubrid Zonateres lanei were seven museum specimens. Among these 212 species, only six were classified as common: the Spotted ground-snake (Adelphostigma occipitalis), Yellow head mussurana (Boiruna maculata), Keeled sepia snake (Dryophylax hypoconia), Amazon coastal house snake (Dryophylax nattereri), Dark blind snake (Liotyphlops beui) and Coastal house snake (Mesotes strigatus). Among the 203 rare species, 18 were classified as threatened and three as Near Threatened (Table S1).

Conversely, 87 reptile species (12% of the total species in GBIF) were only present in the CS subset, making it highly valuable. All of these species were classified as rare, with nine of them listed as globally threatened and one as Near Threatened. This underscores the significant role that volunteer observations can play in capturing data on rare and globally threatened species (Báthori et al., 2022; Fontaine et al., 2022; Tiralongo et al., 2020). Our study identified certain grid cells that had a larger contribution from the CS subset than from the RI subset, especially in the Northeast (Figure 3). While the combination of RI and CS data improved the total coverage of reptile species at the national scale (GBIF includes information on 82.6% of the reptile species in Brazil), there is still potential for improvement. The rarefaction curve has not yet reached a plateau, suggesting the species count could increase with further sampling efforts, especially in understudied regions. Spatial gaps are primarily located in the Amazon Basin, where increased research effort is required to improve our understanding of species abundance and distribution patterns. These regions should be prioritised for attention by academic experts, and the involvement of organised citizen-science initiatives could contribute to filling these gaps (Brooks et al., 2023). An integrated approach that combines the efforts of professionals and volunteers in structured projects (including a programme for training citizens) could result in a highly effective strategy for increasing data availability in these under-researched regions (Callaghan et al., 2019).

Citizen science initiatives typically produce more observations in urban landscapes (Tulloch & Szabo, 2012) and our results indicate that such sources are currently the main contributors of occurrence data on reptiles in Brazil, proportionally surpassing the representativeness of RI data since 2019 (Figure 1). The spatiotemporal differences in RI and CS data were also reflected in differences in the most observed species. We can also infer the distribution of exotic species, as the GBIF contains data on 61 exotic species. Exotic species can affect native biodiversity negatively, and citizen science is considered an effective tool to monitor their distribution and trends (Encarnação et al., 2021; Johnson et al., 2020; Phillips et al., 2021). The abundance of CS data supports macroecological research (Altwegg & Nichols, 2019). For example, many studies have used citizen science data to build species distribution models, particularly for birds and mammals (Feldman et al., 2021). In spite of the data available, other vertebrate groups have received less attention (Feldman et al., 2021). Based on GBIF data, citizen science can contribute to more accurate distribution models for some species (Robinson et al., 2020), such as the House gecko, which has over 1000 observations in Brazil.

The unprecedented global biodiversity crisis underscores the urgency of identifying and cataloguing (Tilker et al., 2020), a task made more challenging by the widespread lack of information on population sizes for most species (Kindsvater et al., 2018). In this context, our study has provided valuable information on biases present in large-scale public data concerning species abundance, which can inform their use for ecological science and informed conservation decision-making (Johnston et al., 2023). We have demonstrated that CS data can be an important source for obtaining reptile species abundance patterns and rarity in Brazil. Nevertheless, biases need to be recognised and accounted for. CS and RI data differ in estimating population size, which can affect the accuracy of classifying rare species. In spite of these challenges, we recommend the integration of these two types of data to study spatial and taxonomic coverage.

5 IMPLICATIONS AND RECOMMENDATIONS

While rare species are not necessarily threatened, a detailed evaluation of data sources and the underlying causes of rarity can serve as an early warning to trigger conservation actions. For example, the teiid lizard Glaucomastix cyanura, which displays high values across all rarity indices in both RI and CS data, is currently classified as Data Deficient by the IUCN. This classification positions it as a potential candidate for evaluation, with a high chance of being recognised as threatened. More concerning is the case of the Pantanal coral snake (Micrurus tricolor), currently evaluated as Least Concern. Yet, this species has reached the maximum values for all rarity indices in both subsets. The convergence of these rarity dimensions strongly suggests that this species needs to have its threat status revised, given its endemicity to the Pantanal, its limited geographical range, and appearently low densities. Furthermore, the populational trend of this species is currently unknown (https://www.iucnredlist.org/species/15202936/15202939). Given the ongoing large-scale land conversion and extreme fires in the Pantanal (Garcia et al., 2021) and future predicted threats (Lima et al., 2020), these environmental pressures intensify the urgency for reassessment. Another similar example is the Black-headed coral snake Micrurus averyi, also classified as Least Concern, yet exhibiting concerningly low rarity indices. Restricted to the northern Amazonia and a small area in Guyana (http://www.reptile-database.org/), its limited observations and small geographical range are cause for concern. Similarly, other species like the colubrid snake Helicops tapajonicus and the tropidurid lizard Tropidurus insulanus should also have their global threat status carefully re-evaluated. Despite certain limitations, our results indicate that CS data have reached a threshold, having accumulated a sufficient amount of data to inform species conservation in Brazil. This approach could also benefit other countries, particularly those that have traditionally lacked extensive population data but are currently experiencing increased volunteer activity. This methodology aids in filling critical data gaps, while also empowering local communities to actively partake in biodiversity conservation.

AUTHOR CONTRIBUTIONS

Lucas Rodriguez Forti: Conceptualization; data curation; formal analysis; investigation; methodology; supervision; writing—original draft; writing—review and editing. Jandson Lucas Camelo da Silva: Formal analysis; writing—original draft; writing—review and editing. Eveline Almeida Ferreira: Formal analysis; visualisation; writing—review and editing. Judit K. Szabo: Conceptualization; supervision; visualisation; writing—review and editing.

ACKNOWLEDGEMENTS

We would like to thank all the researchers community scientists and traditional scientists who have contributed with data. Jandson da Silva received a CAPES scholarship. No animal or human ethics approval was necessary for this study.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflicts of interest.

    DATA AVAILABILITY STATEMENT

    All data are freely available in the Global Biodiversity Information Facility.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.