Influence of sample size on ecological status assessment using marine benthic invertebrate-based indices
Abstract
Routine monitoring of the quality status of water bodies demands the best cost-benefit relation and sample-size reduction is therefore welcomed. However, great caution is needed because such reduction affects the accuracy and variation of the results. In the present study we tested the influence of sample size (number of replicate samples) on reference condition values and within-sample ecological quality ratio (EQR) variability of six commonly used ecological indices (taxa richness, Shannon–Wiener diversity, AMBI, Medocc index, Bentix and M-AMBI). Analysis of soft-bottom benthic invertebrate data from Slovenian coastal waters showed that sample size influenced the reference condition values of richness/diversity indices (taxa richness and Shannon–Wiener diversity) but not of the sensitivity/tolerance indices (AMBI, Medocc index, Bentix). Increased sample size decreased the within-sample EQR variability and, concomitantly, increased the accuracy of site ecological status classification for all indices. The size of EQR variability differed depending on the index used. EQR variability of M-AMBI, an index composed of metrics with different within-sample EQR variability, was statistically the same as that of the metric with the lowest within-sample EQR variability. Whether this is a common principle for multimetric indices remains to be confirmed. Based on these results, the use of at least three replicates is suggested to obtain reliable measures of reference condition and EQR for the assessment of ecological status. This level of replication is particularly necessary in areas with high diversity and environmental patchiness, and when richness/diversity measures and indices that include these measures are used.
Introduction
The European Water Framework Directive (WFD) 2000/60/EC established a framework for the protection and improvement of all waters. Achieving at least ‘good’ Quality Status (QS) of water bodies (WB) by 2015 is the final objective of the directive (EC 2000). The predominant role in assessing the QS of WB has been given to the evaluation or the assessment of the ecological status (ES) of WB. The assessment of ES is based mainly on biological quality elements (BQE). ES is derived from ecological quality ratios (EQR), which correspond to the ratio of the value of the considered parameter at each sampled station and the value of the same parameter from the reference conditions (Wallin et al. 2003). Besides phytoplankton, angiosperms and macroalgae, some of the BQE to be considered in coastal waters are benthic invertebrates.
Several coastal benthic invertebrate-based indices have been developed in recent years, e.g. the AZTI marine biotic index (AMBI) and multivariate AMBI (M-AMBI) (Borja et al. 2000; Muxika et al. 2007a,b); Bentix (Simboura & Zenetos 2002); Medocc index (Pinedo & Jordana 2008); Benthic Quality Index (BQI) (Rosenberg et al. 2004) and Infaunal Quality Index (IQI) (WFD-UKTAG 2008)). Although the WFD outlined the benthic invertebrate parameters to be included in assessing ES (e.g. the level of diversity, abundance of invertebrate taxa and the proportion of disturbance-sensitive and tolerant taxa), not all indices use all of the parameters. Conversely, some indices also incorporate other parameters (e.g. feeding guilds, community similarity) (Borja et al. 2009). The AMBI, Medocc index and Bentix, for example, are univariate and are based solely on the proportion of disturbance-sensitive and tolerant species. Other methods are multivariate or multimetric and include all demanded parameters, but use different indices for assessing species diversity and the sensitivity/tolerance composition of the respective assemblage (e.g. M-AMBI, BQI and IQI).
Setting the reference conditions (RC) is the crucial aspect of ecological assessment and WFD classification. When reliable reference conditions (perceived as the set of conditions to be expected in the absence of or under minimal anthropogenic disturbances) are defined, it is possible to set proper quality class boundaries, to set criteria for ecological status classification (Economou 2002) and to compare boundary values among European Union member states (MSs). Four options are proposed within the WFD for deriving reference conditions: (i) comparison with an existing ‘pristine’/undisturbed site (or a site with very minor disturbance); (ii) using historical data and information; (iii) using models and (iv) using expert judgment (Wallin et al. 2003). In some countries and for some water typologies, reference conditions are still being settled (Carletti & Heiskanen 2009). There are also cases in which countries did not derive RC, but instead used the maximal/minimal possible index value when calculating the EQR (Carletti & Heiskanen 2009).
In routine water management, managing time and costs is a major challenge. Sample size can have an effect on both, so any reduction in sample size is welcomed. Sample size can be reduced by reducing the physical size of the sample (sample surface area), by decreasing the number of sampling units (usually replicates) or by using laboratory sub-sampling (Metzeling & Miller 2001). Caution is needed because this reduction can affect the representativeness of the sample and metric output. The same is true also for certain other parts of the preparation process, such as sieving. Larger mesh size reduces sample processing time and the accompanying costs, but at the same time can affect accuracy and comparability of the obtained results (Pinto et al. 2009; Couto et al. 2010). Much work on a potential sample size effect on sample representativeness has already been undertaken for aquatic benthic invertebrates (Pielou 1966; Sanders 1968; Elliot 1977; Soetaert & Heip 1990; Rumohr et al. 2001; Vlek et al. 2004; Fleischer et al. 2007). Petkovska & Urbanič (2010) showed, based on riverine benthic invertebrates, that increasing sample size decreases the variability of metrics output, but also that the variability depends on the group of metrics; these were defined by Hering et al. (2004) as composition/abundance, richness/diversity, sensitivity/tolerance and functional metrics. High variability among metric values creates problems in an assessment because it reduces reliability of the possible results (e.g. RC, ES) (Doberstein et al. 2000; Vlek et al. 2004; Clarke et al. 2006). Some authors (Boudouresque 1974; Niell 1977) argue that minimal sample size depends upon geographical region, heterogeneity of the area and community analysed. Accordingly, Niell (1977) proposed that a minimal sample size should be carefully estimated, based on the above-mentioned criteria, whenever a study is commenced for the first time.
Until now, very few of the methodologies used in the WFD presented information on the robustness of their metrics or considered the accuracy and variation of the results obtained with them (Borja et al. 2008). In assessing ES, MSs use different sample sizes and in many cases only one replicate per sample (Borja et al. 2007; Carletti & Heiskanen 2009). With the current contribution, we assess the influence of sample size (here taken as the number of replicates) on reference condition values and within-sample EQR variability. Based on the results of this assessment, we also consider the accuracy of ecological status classifications.
Material and methods
Study area
The Slovenian coastal sea is situated in the southern part of the Gulf of Trieste and represents the northernmost part of the Adriatic and the Mediterranean Sea (Fig. 1). The Gulf of Trieste is a shallow semi-enclosed gulf, characterized by the largest tidal differences (semidiurnal amplitudes reach 30 cm) and the lowest winter temperatures (below 10 °C) in the Mediterranean Sea (Boicourt et al. 1999). These conditions are accompanied by high temperature and salinity variations, and strong stratification of the water column (Stravisi 1983). The hydro-dynamism of the Gulf of Trieste is linked mainly to the ascending eastern current coming from the Istrian coast (Stravisi 1983). The general circulation pattern is predominantly counter-clockwise in the lower layer and clockwise in the surface layer. This circulation, especially in the surface layer, can be modulated by prevailing winds, mostly the Bora (Stravisi 1983). Based on granulometric and mineralogical analysis of superficial sediment, seven sedimentary zones can be distinguished in the gulf (Ogorelec et al. 1991). Within 1 mile offshore of the baseline, defined in the WFD as coastal waters (EC 2000), only three zones are present along the Slovenian coast: the Coastal zone of sandy silt, Central parts of the bays with clayey silt, and the Inner transitional zone with silt. The Slovenian coastline is 46.7 km long and consists almost entirely of flysch, which is the major source of detrital material. The coastline is characterized by two major bays, the Bay of Koper and the Bay of Piran, which are wide, submerged valleys of the small rivers Rižana and Dragonja, respectively. Mainly sandstone cliffs represent the remaining coastline. The coastline is under high anthropogenic influence and currently only about 18% remains in its natural state (Turk 1999). Slovenian coastal waters are affected by freshwater inflows, bottom deposit re-suspension, pollution and other anthropogenic impacts (heavy metals, Port of Koper, untreated or partially treated sewage outfalls, intensive farming, overfishing and mariculture), and therefore no reference sites according to the WFD (EU, 2000/60/ES) are available.

Map showing the position of soft-bottom invertebrate sampling sites along the Slovenian coast in Adriatic Sea. VT2_P1 and VT2_P2 represent the two least impacted sampling sites.
Sampling
Altogether, 54 macrozoobenthic samples were taken between 2005 and 2008 at 30 sampling sites scattered along the Slovenian coast (Fig. 1). During 2005 and 2006, the initial sampling of all sites was performed and in 2007 and 2008, six sites were resampled in spring and late summer period for monitoring purposes. We assume that temporal differences in sampling occasion do not influence the results. Benthic samples were obtained with a Van Veen grab (0.1 m2) in depths between 7 and 10 m on mainly clayey silt sediment bottom (40% clay, <5% sand, mean grain size 3–10 μm, carbonate content 30–40%; Ogorelec et al. 1991). Sampling sites were chosen to be as close to the coast as possible but to avoid the sea-grass covered seabed. Three replicates were taken at each sampling occasion. Replicates were treated as separate samples during preparation, identification, and part of the analysis. All benthic samples were sieved through a 3- and 1-mm mesh sieve to simplify sorting, and the retained material was fixed with an 80% ethanol–seawater solution. In the laboratory, benthic invertebrates were sorted into main taxonomic groups, identified to the lowest possible taxonomic level (mainly to species level, rarely to genus or family) and counted. After sorting, the 3-mm and 1-mm fractions were combined and treated together. The use of a large mesh size, such as 3 mm, is common in environmental quality assessments undertaken in the Slovenian region.
Data analysis
We applied two indices commonly used in aquatic biological assessment (Shannon–Wiener diversity index, H′, and taxa richness, S) and four indices used in assessing coastal waters (Bentix, Medocc index, AMBI and M-AMBI). All were selected because they are among the most applied indices in ES assessment in the Mediterranean MS (Carletti & Heiskanen 2009) and were calculated to test the sample size effect on reference condition value and within-sample EQR variability. H′ and S fall within the richness/diversity metric group (sensuHering et al. 2004), Bentix, Medocc index and AMBI into the sensitivity/tolerance group, while M-AMBI combines metrics from both metric groups. The recommended guidelines were followed: for H′ (calculated on log2 basis) Shannon & Wiener (1963), for M-AMBI Muxika et al. (2007a,b), for Bentix Simboura & Zenetos (2002), for Medocc index Pinedo & Jordana (2008) and for AMBI Borja et al. (2000). S was defined as the number of taxa in a sample.
Sample size effect on reference condition values of five indices (H′, S, Bentix, Medocc index and AMBI) was defined using data from the two least impacted sampling sites – VT2_P1 and VT2_P2. The indices’ reference condition values were set following the Slovenian approach (Occhipinti-Ambrogi et al. 2009) where median values of the two least impacted sampling sites were increased (H′, S, Bentix) or decreased (Medocc index, AMBI) by 15% of the absolute difference between the lower anchor (index value indicating the worst possible conditions) and the median value. Using two best-available samples with three replicates each, nine possible combinations were calculated for each index. In case of one-replicate samples, each replicate of site VT2_P1 was combined with each replicate of site VT2_P2. For two-replicate samples, all three pairs of two replicates obtained at site VT2_P1 were combined with all three pairs of two replicates obtained at site VT2_P2. All newly derived reference values for one- and two-replicate samples were normalized with reference condition values calculated using three-replicate samples. A one-way ANOVA was run for each sample size (1 and 2 replicates), with a Tukey HSD post-hoc test performed to determine differences among tested indices in reference condition value variability. Occasionally, deviations from homogeneity of variances or normality were detected but we assumed the ANOVAs to be robust to these deviations. Nevertheless, differences between index values of one- and two-replicate samples were tested with the non-parametric Mann–Whitney U-test.

Results
Sample size effect on reference conditions
The reference condition values for AMBI (Mann–Whitney U-test, U = 40, P > 0.05), Bentix (Mann–Whitney U-test, U = 39, P > 0.05) and Medocc index (Mann–Whitney U-test, U = 40, P > 0.05) showed no statistically significant differences between examined sample sizes, whereas for S (Mann–Whitney U-test, U = 0, P < 0.001) and H′ (Mann–Whitney U-test, U = 7, P < 0.01) this difference was statistically significant (Fig. 2). With one-way ANOVA, a significant effect of index selection was observed for both examined sample sizes (F = 29.16– 130.76, P < 0.0001). Using Tukey HSD post-hoc tests (α = 0.05) of one- and two-replicate samples, three and two groups of metrics were established, respectively (Fig. 2). Independent of the number of replicates per sample, the normalized reference condition values for the AMBI, Bentix and Medocc index did not differ significantly (Tukey HSD post-hoc tests, P > 0.05). For two-replicate samples, the normalized reference condition value for H′ also did not differ significantly from the previously mentioned group (Tukey HSD post-hoc tests, P > 0.05). In one-replicate samples, however, it did differ from this group and from the normalized reference values for S (Tukey HSD post-hoc tests, P < 0.002) for one-replicate samples. The lowest statistically significant normalized reference condition values were detected for S, for both sample sizes (Tukey HSD post-hoc tests, P < 0.00001).

Normalized reference condition values of five indices with one- (filled bars) and two-replicate (open bars) samples; means (bars) and 95% CI are given. For a given sample size, bars with different letters are significantly different (Tukey HSD post-hoc test, P < 0.05), S, taxa richness; H′, Shannon–Wiener diversity index; n.s., not significant; **P < 0.01, ***P < 0.001.
Sample size effect on within-sample EQR variability
Within-sample variation of EQR values was higher for one-replicate than for two-replicate samples for all six indices (Student’s t-tests, t = 4.76–5.95, P < 0.0001) (Fig. 3, Table 1). Indices differed significantly in within-sample EQR variation for one-replicate (ANOVA, F = 16.41, P < 0.0001) and two-replicate samples (ANOVA, F = 18.60, P < 0.0001). With Tukey HSD post-hoc tests (α = 0.05) of one- and two-replicate samples, four and three groups of metrics were established, respectively (Fig. 3). The highest statistically significant within-sample EQR variability was detected for S, for both sample sizes (Tukey HSD post-hoc tests, P < 0.002). Independent of the number of replicates per sample, within-sample EQR variability of the H′, AMBI and M-AMBI on one hand, and Bentix and Medocc index on the other, did not differ significantly (Tukey HSD post-hoc tests, P > 0.05). For two-replicate samples, the latter two indices showed the lowest within-sample EQR variability (Tukey HSD post-hoc tests, P < 0.0001). For one-replicate samples, M-AMBI and Medocc index did not differ in within-sample EQR variability (Tukey HSD post-hoc tests, P > 0.05). Examining only the metrics of the sensitivity/tolerance group reveals that EQR variability for AMBI is higher than for the Medocc index and Bentix.

Within-sample EQR variability of six indices with one- (filled bars) and two- (open bars) replicate samples; means (bars) and 95% CI are given. For a given sample size, bars with different letters are significantly different (Tukey HSD post hoc test, P < 0.05). S, taxa richness; H′, Shannon–Wiener diversity index, ***P < 0.001.
Index | No. of replicates | Mean EQR range | Min. EQR range | Max. EQR range | SD |
---|---|---|---|---|---|
S | 1 | 0.20 | 0.03 | 0.50 | 0.10 |
2 | 0.11 | 0.03 | 0.24 | 0.06 | |
H′ | 1 | 0.13 | 0.01 | 0.32 | 0.07 |
2 | 0.08 | 0.01 | 0.16 | 0.04 | |
AMBI | 1 | 0.14 | 0.02 | 0.36 | 0.08 |
2 | 0.07 | 0.01 | 0.18 | 0.04 | |
M-AMBI | 1 | 0.13 | 0.02 | 0.26 | 0.06 |
2 | 0.08 | 0.00 | 0.16 | 0.04 | |
Bentix | 1 | 0.09 | 0.01 | 0.27 | 0.05 |
2 | 0.04 | 0.01 | 0.14 | 0.03 | |
Medocc index | 1 | 0.10 | 0.00 | 0.24 | 0.06 |
2 | 0.05 | 0.00 | 0.12 | 0.03 |
Discussion
Several patterns regarding a sample size effect on the metrics used for the study of benthic invertebrate assemblages have been reported. These studies involve metrics such as composition/abundance, richness, diversity, and sensitivity/tolerance. For richness it is commonly accepted that values increase with sample size until the asymptotic value is reached (e.g.Sanders 1968). In our study, the difference in the taxa richness was statistically significant for the two observed sample sizes. Regarding diversity metrics, the pattern depends on the diversity index used (Sanders 1968; Chadwick & Canton 1984; Soetaert & Heip 1990; Petkovska & Urbanič 2010). For H′, the information is contradictory. Wilhm (1970) reported that H′ increases in the first few samples until it reaches the asymptote, whereas Chadwick & Canton (1984) claimed that H′ is relatively independent of sample size. In our study, the difference in H′ was statistically significant for one- and two-replicate samples. This could confirm the findings of Wilhm (1970) but more replicates would be necessary to be certain. Simboura & Zenetos (2002) reported that H′ is moderately dependent on sample size but they added that it depends also on habitat type and taxonomic effort. In our study, the observed differences could be related entirely to sample size because taxonomic effort and habitat type were the same for all samples. Soetaert & Heip (1990) argued that the pattern depends also on the diversity of sampled assemblages, with the diversity/sample-size dependency being more pronounced in high-diversity assemblages. Mavričet al. (2010) reported that the benthic invertebrate assemblages of the present study area are quite diverse. In running waters, the metrics of the sensitivity/tolerance group were found to be independent of sample size (Vlek et al. 2004; Petkovska & Urbanič 2010). The same was observed for Bentix and AMBI in coastal waters (Simboura & Zenetos 2002; Fleischer et al. 2007). Our results confirmed the general pattern of independence for all three sensitivity/tolerance-tested indices (Bentix, AMBI and Medocc index). This finding is not surprising because these indices are weighted averages of sensitivity/tolerance levels based on species abundance. Thus, rare species and number of species have little impact on them.
Reference conditions values must be determined for all metrics of the benthic invertebrates included in the assessment of ES. As is evident from the dependence of richness/diversity metrics on sample size, RC for some metrics can be influenced by sample size. For the M-AMBI metrics, which was not included in the formal assessment of the influence of the number of replicates on reference condition values, sample size is also important. This is because M-AMBI is composed of three different indices, of which two (taxa richness and H′) were shown in this study to be influenced by sample size. For these metrics we suggest using at least three replicates to obtain appropriate precision and representativeness.
Decreasing values of variation coefficients with increasing sampling effort have already been reported for biological metrics, even in coastal waters. Fleischer et al. (2007), for example, reported this for AMBI and Rumohr et al. (2001) for H′. Our research showed a statistically significant decrease in within-sample EQR variability with increased number of replicates for all six indices (S, H′, AMBI, M-AMBI, Medocc index and Bentix). Biotic populations are usually distributed heterogeneously in their habitat and the distribution itself is usually patchy (Downing 1991; Dhar 2000). Mavričet al. (2010) demonstrated the high heterogeneity of the environment from which the present data originate, and consequently the high heterogeneity of the community of benthic invertebrates. The EQR variability within one sampling station most probably reflects variability related to microscale distribution. Having a number of replicates that provides a high ecological status classification accuracy, or otherwise a high confidence level in the obtained results, is crucial in monitoring programmes and management plans. The confidence level can depend on EQR variability. ECOSTAT (group in charge of WFD implementation) stated that a range of 0.05 EQR units is an acceptable deviation from the mean and can be considered an agreement in the ecological quality classification (Borja et al. 2008). In our study, the maximum EQR variability for one-replicate samples were between 0.50 for S and 0.24 for the Medocc index (Table 1). These differences would translate into different ecological quality status (ES) for a site because one-class width is usually 0.2 or less (Carletti & Heiskanen 2009). One-replicate samples, with a sampling area of 0.1 m2, are therefore inappropriate for ES assessment, at least in the Gulf of Trieste. Based on data from a Basque coastal and estuarine area, Muxika et al. (2007a,b) pointed out that two replicates are sufficient for AMBI to classify a sampled site into a proper disturbance level. Simboura et al. (2005, 2007) and Simboura & Reizopoulou (2007) assessed the ES of coastal embayments of the Aegean Sea (Eastern Mediterranean) by applying the Bentix index on benthic data from two replicate samples. However, it is worth noting that these Aegean Sea ecosystems are oligotrophic, euhaline and microtidal, and that the benthic fauna is usually very diverse and evenly distributed, with no one species naturally dominating over 10% under natural conditions (Simboura & Reizopoulou 2007). The results of our study suggest that an increase of sample size to two-replicates would yield a more accurate ES assessment. Nonetheless, the maximum values, which vary between 0.24 for S and 0.12 for the Medocc index, are still high and show the need for more replicate samples. As was proposed by Soetaert & Heip (1990) for the diversity/sample-size dependency (more pronounced in high-diversity assemblages), this could also be valid for EQR variability/sample size dependence. However, the selection of the biological metric is even more important for this dependence. We showed that EQR variability of taxa richness for a given number of replicates is higher compared with H′ and metrics of the sensitivity/tolerance group. Moreover, examining only the metrics of the sensitivity/tolerance group revealed that EQR variability for AMBI is higher than for the Medocc index and Bentix (but the same as for H′). This result might reflect (i) the AMBI’s higher number of different weight coefficients given to ecological groups (EG) (5 for AMBI, 4 for the Medocc index and 2 for Bentix), although the Medocc index and Bentix showed no difference, and (ii) the fact that the AMBI index has the smallest index range, so that the same absolute index change means a bigger EQR change for indices with the smallest range. M-AMBI is a multimetric index composed of metrics with different within-sample EQR variability (S has the highest, AMBI and H′ have modest variability), but its within-sample EQR variability was modest. Combining different metrics in a multimetric index reduced the within-sample EQR variability. The EQR variability of a multimetric index was statistically the same as that of the metric with the lowest within-sample EQR variability. Whether this finding is a common principle still needs to be confirmed.
Conclusions
The present study shows that sample size (number of replicates) influenced the reference condition values for S and to a less extent for H′, whereas no influence was observed for AMBI, Medocc index and Bentix reference condition values (up to two replicates). More importantly, sample size affected the EQR variability and thus accuracy of site ecological status classification, although the degree of the effect depended on the index used. Increasing sample size from one to two replicates considerably decreased within-sample EQR variability, but the need for an even further decrease was observed. The results suggest that at least three replicates are needed to obtain reliable measures of reference condition and EQR for the assessment of ecological status based on indices of marine benthic invertebrates. This suggestion is appropriate especially in cases when diversity measures and indices that include them are used, and for areas with environmental properties similar to those in the Gulf of Trieste (high diversity and environmental patchiness).
Acknowledgements
The dataset was created with financial support from the Ministry of Environment and Spatial Planning of Slovenia funding a project for the implementation of the Water Framework Directive and from the Agency for Environment of the Republic of Slovenia funding the national monitoring program. The authors thank Milijan Šiško for technical assistance, Žiga Dobrajc and Marko Tadejevič for their help during fieldwork and sorting, and Michael Stachowitsch, Martina Orlando Bonaca and Janja France for their support and suggestions during the writing process. Identification work by Nicola Betosso, Lorenzo C. Saitz and Cene Fišer, and help with index application by A. Borja and S. Pinedo are also gratefully acknowledged.