Volume 43, Issue 15 pp. 4620-4639
RESEARCH ARTICLE
Open Access

A comparison of intracranial volume estimation methods and their cross-sectional and longitudinal associations with age

Stener Nerland

Corresponding Author

Stener Nerland

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Correspondence

Stener Nerland and Claudia Barth, Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway.

Email: [email protected]; [email protected]

Search for more papers by this author
Therese S. Stokkan

Therese S. Stokkan

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Search for more papers by this author
Kjetil N. Jørgensen

Kjetil N. Jørgensen

NORMENT, University of Oslo, Oslo, Norway

Department of Psychiatry, Telemark Hospital, Skien, Norway

Search for more papers by this author
Laura A. Wortinger

Laura A. Wortinger

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Search for more papers by this author
Geneviève Richard

Geneviève Richard

NORMENT, Division of Mental Health and Addiction, Oslo University Hospital, Oslo, Norway

Search for more papers by this author
Dani Beck

Dani Beck

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Search for more papers by this author
Dennis van der Meer

Dennis van der Meer

School of Mental Health and Neuroscience, Faculty of Health, Medicine and Life Sciences, Maastricht University, Maastricht, The Netherlands

Search for more papers by this author
Lars T. Westlye

Lars T. Westlye

NORMENT, Division of Mental Health and Addiction, Oslo University Hospital, Oslo, Norway

Department of Psychology, University of Oslo, Oslo, Norway

Search for more papers by this author
Ole A. Andreassen

Ole A. Andreassen

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, Division of Mental Health and Addiction, Oslo University Hospital, Oslo, Norway

Search for more papers by this author
Ingrid Agartz

Ingrid Agartz

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Centre for Psychiatry Research, Department of Clinical Neuroscience, Karolinska Institutet, Stockholm, Sweden

Stockholm Health Care Services, Stockholm Region, Stockholm, Sweden

Search for more papers by this author
Claudia Barth

Corresponding Author

Claudia Barth

Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway

NORMENT, University of Oslo, Oslo, Norway

Correspondence

Stener Nerland and Claudia Barth, Department of Psychiatric Research, Diakonhjemmet Hospital, Oslo, Norway.

Email: [email protected]; [email protected]

Search for more papers by this author
First published: 16 June 2022
Citations: 2

Stener Nerland and Therese S. Stokkan contributed equally to this study.

Funding information: Helse Sør-Øst RHF, Grant/Award Numbers: 2017-097, 2019-104, 2020-020; Norges Forskningsråd, Grant/Award Numbers: 213700, 223273, 250358; Stiftelsen Kristian Gerhard Jebsen, Grant/Award Number: SKGJ-MED-008

Abstract

Intracranial volume (ICV) is frequently used in volumetric magnetic resonance imaging (MRI) studies, both as a covariate and as a variable of interest. Findings of associations between ICV and age have varied, potentially due to differences in ICV estimation methods. Here, we compared five commonly used ICV estimation methods and their associations with age. T1-weighted cross-sectional MRI data was included for 651 healthy individuals recruited through the NORMENT Centre (mean age = 46.1 years, range = 12.0–85.8 years) and 2410 healthy individuals recruited through the UK Biobank study (UKB, mean age = 63.2 years, range = 47.0–80.3 years), where longitudinal data was also available. ICV was estimated with FreeSurfer (eTIV and sbTIV), SPM12, CAT12, and FSL. We found overall high correlations across ICV estimation method, with the lowest observed correlations between FSL and eTIV (r = .87) and between FSL and CAT12 (r = .89). Widespread proportional bias was found, indicating that the agreement between methods varied as a function of head size. Body weight, age, sex, and mean ICV across methods explained the most variance in the differences between ICV estimation methods, indicating possible confounding for some estimation methods. We found both positive and negative cross-sectional associations with age, depending on dataset and ICV estimation method. Longitudinal ICV reductions were found for all ICV estimation methods, with annual percentage change ranging from −0.293% to −0.416%. This convergence of longitudinal results across ICV estimation methods offers strong evidence for age-related ICV reductions in mid- to late adulthood.

1 INTRODUCTION

Intracranial volume (ICV), defined as the volume within the cranium including the brain, meninges, and cerebrospinal fluid (CSF), is an important measure in brain magnetic resonance imaging (MRI) studies. It is frequently used to adjust for individual variation in head size (O'Brien et al., 2011; Voevodskaya et al., 2014) and as a proxy for premorbid brain volume in the study of neurodegenerative diseases (Davis & Wright, 1977). Manual delineation of structural brain MRI scans is considered the most accurate in vivo method for determining ICV (Huo et al., 2016; Klasson et al., 2015; Whitwell et al., 2001). However, this approach is labor intensive and requires training, making it impractical for large datasets. To overcome these limitations, a variety of automated methods for computing ICV using T1-weighted structural MRI have been developed. Previous studies have reported varying consistency and agreement between ICV estimation methods (Malone et al., 2015; Sargolzaei et al., 2015). In one study, associations between hippocampal volume, education and a cognitive measure differed between ICV estimation methods (Nordenskjöld et al., 2013). Some studies have also indicated that the accuracy of automated ICV estimation varies as a function of head size (Klasson et al., 2018). Such findings highlight the importance of assessing ICV estimation methods for potential sources of bias, which may otherwise introduce spurious effects in studies relying on ICV estimates. In the present study, we pursued two related aims: First, we compared ICV estimation methods and assessed potential sources of bias influencing ICV estimation. Secondly, we assessed both cross-sectional and longitudinal associations between age and ICV.

Automated ICV estimation methods typically use T1-weighted MRI images and can be broadly classified as either registration- or segmentation-based. Registration-based methods estimate ICV via an atlas scaling factor given by the determinant of an affine transformation of individual MRI images to a template. The two most common registration-based ICV estimation methods are estimated Total Intracranial Volume (eTIV; Buckner et al., 2004) in FreeSurfer and SIENAX from FSL (FMRIB Software Library; Smith et al., 2002, 2004). With segmentation-based methods, MRI images are first segmented into tissue compartments which are then used to calculate volumetric estimates of the intracranial cavity. One popular segmentation-based method is the Tissue Volumes utility in Statistical Parametric Mapping (SPM; https://www.fil.ion.ucl.ac.uk/spm/). CAT12 is an extension of SPM12 that aims to provide a more robust segmentation algorithm (http://www.neuro.uni-jena.de/cat/). Both SPM12 and CAT12 compute ICV as a probability-weighted sum of grey matter (GM), white matter (WM) and CSF. Sequence Adaptive Multimodal Segmentation (SAMSEG; Puonti et al., 2016) was recently introduced in FreeSurfer version 7 and computes the segmentation-based Total Intracranial Volume (sbTIV).

Throughout the lifespan, ICV increases rapidly from early childhood until early adolescence and is thought to remain relatively stable throughout adulthood (Mills et al., 2016; Pfefferbaum et al., 1994). These findings are consistent with studies on head circumference and computed tomography (Bergerat et al., 2021; Huda et al., 2004; Neubauer et al., 2009). In adulthood, the presence of ICV change is less well-established, and it remains uncertain if continued changes occur and to what extent. However, past non-MRI studies on ICV and head size have often been limited to childhood and adolescence. A notable exception is the cross-sectional study by Weaver and Christian (1980), reporting no significant association between age and occipitofrontal head circumference in 567 participants (50% female) aged 18 to approximately 67 years.

A number of cross-sectional MRI studies have assessed the association between age and ICV in adulthood, some of which report significant negative associations with age while others find no significant associations. For instance, DeCarli et al. (2005) estimated ICV using in-house software in 2081 participants from 34 to 96 years of age and found subtle negative associations with age (~0.1% ICV change per year), independent of sex. Similarly, in a study of 147 participants between the ages of 15 and 96, Buckner et al. (2004) reported a slight negative association between manually determined ICV and age (−1.05 cm3/year). A notable recent example is the study by Ma et al. (2018), where cross-sectional associations with age were analyzed in a large dataset consisting of 7656 scans for 1727 elderly subjects (55–95 years of age) from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Three diagnostic groups were included; cognitive normal, mild cognitive impairment, and an Alzheimer's disease group. They observed no significant cross-sectional associations between age and ICV as estimated with three methods: eTIV, SPM12, and multiatlas label fusion (MALF; Huo et al., 2016).

Most studies on putative associations between ICV and age are cross-sectional rather than longitudinal (Good et al., 2001; Kim et al., 2018; Kruggel, 2006). Such study designs are limited in their ability to resolve age trajectories across the lifespan and can be influenced by generational effects such as secular growth rates (Miller & Corsellis, 1977). In Caspi et al. (2020), both longitudinal and cross-sectional age effects on ICV were assessed in participants between the ages of 16 and 55 years. They included 528 participants at baseline, 378 at follow-up, and 309 at the second follow-up, where the mean period between time points was 3.3 years. In the longitudinal analysis, they reported nonlinear associations where the annual percentage change (APC) of ICV was initially positive (0.03% APC at age 20) and then negative in later adulthood (−0.09% APC at age 55). In the cross-sectional analysis, the data indicated a predicted ICV change of 0.21% for males and 0.22% for females at 20 years of age. They attributed the different magnitudes of effects in the longitudinal and cross-sectional analyses to different secular growth rates. We found only one other study on longitudinal ICV changes in the literature, which showed a statistically significant increase in ICV in the youngest group (≤34 years) but no significant longitudinal age changes in mid- (35–54 years) to late (≥54 years) adulthood (Liu et al., 2003) where ICV was estimated with Exbrain (Lemieux et al., 2003). However, this study was limited by a small sample size with only 44 participants in the first group, 37 in the second, and 9 in the third.

In the present study, we compared five of the most frequently used automated ICV estimation methods to test their relative consistency and absolute agreement, as well as their cross-sectional and longitudinal associations with age. Given the robustness of the ICV measure, we expected that correlation coefficients between estimation methods would exceed 0.9. We further hypothesized that the agreement between ICV estimates would be higher between the two registration-based methods and between the three segmentation-based methods, than across the two groups of ICV estimation methods. To test potential confounding by age, sex, height, body weight, and mean ICV across methods, we computed their explained variance of the pairwise differences between ICV estimation methods. To examine cross-sectional associations with age, we performed exploratory regression analyses for each dataset and ICV estimation method assessing both linear and nonlinear effects of age. The influence of sex, height, and body weight on cross-sectional age trajectories was investigated in separate interaction models. Finally, based on the study by Caspi et al., 2020, we expected lower ICV at follow-up compared with baseline with an APC of ~0.09%.

2 MATERIALS AND METHODS

2.1 Participants and MRI

Participants from two MRI datasets were included in this study: (1) Norwegian Centre for Mental Disorders Research (NORMENT), a cross-sectional dataset recruited in the greater Oslo region of Norway, and (2) UK Biobank (UKB), a longitudinal dataset recruited in the United Kingdom. See Table 1 for demographics for each dataset and Figure 1 for the age distribution for each dataset.

TABLE 1. Sample demographics. Continuous variables reported as mean ± SD. For the UKB dataset, we report age, height, and body weight at baseline imaging session.
Sample demographics
n Sex ratio (F/M) (% female) Age ± SD (years) Age range (years) Height ± SD (cm) Weight ± SD (kg)
UKB 2410 1203/1207 (49.9%) 63.2 ± 7.1 47.0–80.3 170.9 ± 9.6 75.8 ± 14.7
NORMENT 651 366/285 (56.2%) 46.1 ± 18.7 12.0–85.8 173.5 ± 9.3 74.5 ± 14.8
Y-TOP 61 38/23 (62.3%) 17.0 ± 1.9 12.0–20.6 170.8 ± 7.9 62.2 ± 11.3
TOP 275 131/144 (47.6%) 36.9 ± 9.8 17.7–56.7 175.2 ± 9.4 77.7 ± 14.8
StrokeMRI 315 197/118 (62.5%) 59.8 ± 14.5 20.0–85.8 172.5 ± 9.2 74.1 ± 14.1
Combined dataset 3061 1569/1492 (51.3%) 59.6 ± 12.8 12.0–85.8 171.4 ± 9.6 75.5 ± 14.7
Details are in the caption following the image
Raincloud plot of the age distributions for each subsample. It is composed of flat violin plots, points representing the age of each participant, and horizontal box plots. The lower and upper hinges correspond to the first and third quartiles. The upper and lower whiskers extend to within 1.5 times interquartile range of the hinges. For UKB we report age at baseline imaging session.

2.1.1 NORMENT dataset

Healthy participants were pooled from three clinical studies at the NORMENT Centre: the Youth-Thematically Organized Psychosis (Y-TOP) study (n = 61), the Thematically Organized Psychosis (TOP) study (n = 275), and the StrokeMRI study (n = 315). Participants with complete data, that is, age, sex, body weight, height, and T1-weighted MRI data, were included. For Y-TOP and TOP, invitations to healthy participants were sent out to a random sample, stratified by age and region, from the Norwegian National Population Register in the greater Oslo region. For the StrokeMRI study, healthy participants were recruited via advertisement in local newspapers, social media, and word-of-mouth.

Exclusion criteria for Y-TOP and TOP included a history of neurological disorders or moderate to severe head injury, current or previous diagnosis of a psychiatric disorder, a family history of severe mental disorders, IQ below 70 points, and meeting the criteria for alcohol or substance dependency at the time of MR. For TOP, cannabis use within the last 3 months prior to assessment was an additional exclusion criterion. For StrokeMRI, exclusion criteria included serious head trauma, a history of stroke, dementia, or other severe neurological and psychiatric diseases, alcohol- and substance abuse, and medication use thought to affect the nervous system.

Adult participants gave written informed consent to participate. For adolescent participants below 16 years of age, written assent and parental consent was given. Participants were remunerated with a gift card worth 500 NOK. The studies were approved by the Regional Committee for Research Ethics (REK) and the Norwegian Data Protection Authority, and were carried out in accordance with the Helsinki Declaration.

T1-weighted images were acquired at the Oslo University Hospital, Ullevål, on a 3 T General Electric Discovery MR750 scanner, with a 32-channel head coil, between June 2015 and May 2019. An inversion recovery-prepared 3D gradient recalled echo (BRAVO) sequence was employed with the following parameters: repetition time = 8.16 ms, echo time = 3.18 ms, inversion time = 400 ms, field of view = 256 mm, flip angle = 12°, matrix = 188 × 256, voxel size = 1 mm isotropic, 188 sagittal slices.

2.1.2 UKB dataset

Participants with complete longitudinal MRI data and data on age, sex, weight, and height for both baseline and follow-up imaging sessions were selected from the UKB cohort (www.ukbiobank.ac.uk). Time from baseline to follow-up was 2.3 years on average with a standard deviation of 0.1 years. MRI data was acquired at two different sites and participants were scanned at the same site for baseline and follow-up.

We excluded participants with a height difference greater than 5 cm between baseline and follow-up, as we considered this to be indicative of measurement error. Participants were also excluded if they had been diagnosed with disorders known to influence brain structure based on diagnoses from the International Statistical Classification of Diseases and Related Health Problems (ICD-10; World Health Organization, 2004). Diagnostic exclusion criteria included disorders in chapter V and VI, field F; mental and behavioral disorders, including F00–F03 (Alzheimer's disease and dementia), F06.7 (mild cognitive disorder), and field G (diseases of the nervous system), including inflammatory and neurodegenerative diseases (except G55-59; “Nerve, nerve root, and plexus disorders”).

An overview of the UK Biobank acquisition protocols is available in Alfaro-Almagro et al. (2018) and Miller et al. (2016). For MRI, a magnetization prepared rapid acquisition gradient echo (MPRAGE) sequence was employed on a Siemens Skyra 3 T scanner with a standard Siemens 32-channel RF receive head coil. The following parameters were used: repetition time = 2000 ms, echo time = 2.01 ms, inversion time = 880 ms, field of view = 256 mm, flip angle = 8°, matrix = 208 × 256, voxel size = 1 mm isotropic.

2.2 MRI image processing

T1-weighted MRI images were processed to yield two registration-based ICV measures (eTIV, FSL) and three segmentation-based ICV measures (sbTIV, SPM12 and CAT12). For SPM12 and CAT12, we used MATLAB (The MathWorks, Inc., Massachusetts) version R2018b.

2.2.1 eTIV

We calculated eTIV using the standard processing pipeline, recon-all, in FreeSurfer (v5.3.0 for the UKB dataset and v6.0.0 for the NORMENT dataset; https://surfer.nmr.mgh.harvard.edu/). This pipeline performs intensity nonuniformity correction and normalization, skull stripping, and registration to the fsaverage template which is based on the MNI305 template (Evans et al., 1993). The linear scaling factor of this transformation is converted to an ICV estimate by multiplication with the ICV of fsaverage.

2.2.2 FSL

FSL (v6.0.1; https://fsl.fmrib.ox.ac.uk/fsl/) computes ICV with the SIENAX package by first extracting brain and skull images from a single T1-weighted MRI image which is then affinely registered to the MNI152 template (Grabner et al., 2006). Points along the skull are used when determining the registration scaling factor. The final ICV estimate is calculated by multiplying the scaling factor with the measured ICV of the MNI152 template. We ran the bet command with a fractional intensity threshold of 0.35 and enabled the bias field and neck cleanup flags. The threshold of 0.35 was chosen as it gave the best results in a previous study (Berg et al., 2018), where a systematic test of fractional intensity thresholds was performed. The use of a lower fractional intensity threshold than the default of 0.5 is in line with previous studies (Popescu et al., 2012; Vichianin et al., 2018).

2.2.3 SPM12

SPM12 (https://www.fil.ion.ucl.ac.uk/spm/) uses a unified segmentation algorithm to perform tissue classification, bias correction and image registration within the same generative model. Based on prior tissue probability maps, it segments the image into tissue classes weighted by the probability of the tissue membership of each tissue type. We used the “Tissue Volumes” utility in SPM12 with the default parameters to calculate ICV as the sum of WM, GM, and CSF.

2.2.4 CAT12

As with SPM12, CAT12 (v12.7; http://www.neuro.uni-jena.de/cat/) uses tissue probability maps to spatially normalize, skull-strip, and initialize the segmentation. In contrast to SPM12, CAT12 uses an adaptive maximum a posteriori segmentation approach for determining the final segmentation which accounts for local intensity variations in the original image (Tavares et al., 2020). The goal of this procedure is to provide a segmentation algorithm that is less sensitive to differences in image intensity. As with SPM12, ICV is calculated as the partial volume-adjusted sum of WM, GM and CSF. CAT12 processing was performed with the default parameters.

2.2.5 sbTIV

To compute sbTIV, we used SAMSEG (https://surfer.nmr.mgh.harvard.edu/fswiki/Samseg; Puonti et al., 2016), which creates probability-weighted segmentations of the input image, including skull, non-brain tissue and CSF. sbTIV (https://surfer.nmr.mgh.harvard.edu/fswiki/sbTIV) is computed as a sum of the WM, GM, and CSF volumes. SAMSEG was run with the default parameters using a single T1-weighted image as input.

2.3 Statistical analyses

In the combined NORMENT and UKB dataset, we computed Pearson correlations between each ICV estimate and performed Bland–Altman and relative importance analysis. To accommodate for the presence of any sample-specific associations, for example, due to generational or scanner differences, cross-sectional associations with age were assessed in the two datasets separately. Longitudinal analyses were limited to the UKB dataset. To ensure the soundness of the fitted regression models, we visually assessed residuals versus fitted values, scale-location, quantile-quantile, and residuals versus leverage plots. All statistical tests were performed using R Statistical Software (v3.6.3; R Core Team, 2020).

2.3.1 Outlier correction

To avoid the excess influence of outliers due to measurement errors, we assessed each ICV estimation method for outliers and excluded the corresponding participants in all subsequent analyses. To identify outliers, we used the median absolute deviation method (Leys et al., 2013) implemented in the R package Routliers (https://CRAN.R-project.org/package=Routliers). Briefly, this method finds the median of the absolute deviations from the median, which yields the median absolute deviation (MAD). Upper and lower thresholds are determined by taking the median of the original dataset ± b·MAD where b is the specified threshold (unitless). This approach has the advantage of being robust with respect to sample size and the presence of extreme values.

For the cross-sectional datasets, we used a deviation threshold of 3 on the ICV estimates to identify cross-sectional outliers. For the UKB dataset, we also excluded longitudinal outliers based on the pairwise differences between ICV at baseline and follow-up, where we used a less strict deviation threshold of 4. Since outlier removal can affect results, analyses without outlier correction were also performed.

2.3.2 Pearson correlation analyses

To assess pairwise linear relationships between ICV estimation methods in the combined dataset, we calculated Pearson correlation coefficients (r) with 95% confidence intervals (CI) for each pair of estimation methods using the function cor.test from the stats R package. To compare the correlations between registration- and segmentation-based methods, we calculated pooled correlations by first applying the Fisher transformation to the correlation coefficients before averaging and back-transforming using the inverse Fisher transformation. We calculated the pooled correlations for each registration-based method with respect to all the segmentation-based methods and compared these pooled correlations to the correlation between the registration-based methods.

2.3.3 Bland–Altman analysis

For each pair of ICV estimation methods in the combined dataset, we created Bland–Altman plots by plotting differences (ΔICV) between ICV method pairs against their means, which can be seen as a proxy for head size. In the Bland–Altman plots, we used 95% agreement intervals with upper and lower limits of agreement calculated as ΔICV ±1.96 standard deviation (SD) (Altman & Bland, 1983; Giavarina, 2015). A deviation of ΔICV from zero shows the presence, magnitude, and direction of the difference, or bias, between methods. The bias can be constant or proportional in relation to mean ICV. In the latter case, the difference between methods varies as a function of mean ICV. We quantified these associations by calculating Pearson correlation coefficients between ΔICV and mean ICV and testing their statistical significance.

2.3.4 Relative importance analysis

To assess the influence of age, sex, height, body weight, and mean ICV across the compared estimation methods on the differences between ICV estimates (i.e., ΔICV) in the combined dataset, we performed relative importance analyses with the relaimpo package in R (https://CRAN.R-project.org/package=relaimpo) using the Lindeman, Merenda, and Gold (LMG) metric. This method performs an averaging over the orderings of the explanatory variables to decompose the explained variance of the full model, ΔICV ~ Age + Sex + Weight + Height + Mean, into the non-negative contributions of each of the variables. The decomposition is constrained so that its sum equals the R2 of the full model. The advantage of this method over sequential or nested approaches is that the internal correlational structure of the regressors is taken into account, and the resulting variance decomposition is unbiased (Lindeman et al., 1980).

If neither measure is confounded by age, sex, height, body weight, or mean ICV across methods, we would expect them to explain a negligible proportion of the nonshared variation between them, that is, the variation of the difference. Thus, we interpreted the degree to which ICV estimate differences were explained by these variables as an indication of possible bias. Note that wherever these variables explain a large proportion of the variance in the difference between two methods, it is not possible to say which of the methods are confounded.

2.3.5 Cross-sectional associations between age and ICV

To test for cross-sectional associations between age and ICV, we first fitted linear regression models for each method with ICV as the outcome variable and age and sex as independent variables. These analyses were conducted separately for the NORMENT and UKB datasets. In additional models, we also included age2 to account for possible quadratic associations with age. To visualize the relationship between age and ICV, we created partial regression plots (Velleman & Welsch, 1981) where both age and ICV were residualized with respect to sex and plotted against each other. Cross-sectional Annual Percentage Change (CS-APC) was calculated by expressing the estimated coefficient of the age term in the fitted model as a percentage relative to the mean ICV for each dataset.

In addition to the linear models, we tested for nonlinear relationships between ICV and age using a generalized additive model (GAM). This method models possible nonlinear relationships using smooth functions (Hastie & Tibshirani, 1986) which can account for higher order nonlinearities. We used cubic regression splines to model age while adjusting for sex. The restricted maximum likelihood method was used as in Sørensen et al. (2021) with the R package mgcv (https://CRAN.R-project.org/package=mgcv).

To test the influence of height and body weight on cross-sectional age associations, we fitted additional models including age, sex, height, and body weight as independent variables, as well as separate models with age-by-height and age-by-body weight interactions. To test the influence of sex on age associations, we included models with age, sex, and an age-by-sex interaction term.

To compare the relative qualities of the linear, quadratic and regression spline models, we used the Akaike Information Criterion (AIC; Akaike, 1974). We calculated and compared AIC scores within each ICV estimation method. A low score indicates less information loss in the model, which is a trade-off between goodness of fit and the simplicity of the model. A difference in AIC scores greater than 2 is considered to indicate a significantly better relative quality for the model with the lower score.

Since the age range differed between the NORMENT and the UKB datasets, we performed additional age-matched analyses where we only included a subset of 289 participants in the NORMENT dataset with an age above 47 years.

In post-hoc analyses, we ran additional linear regression models where we examined if the age effect differed in the two datasets. Here, ICV was used as the outcome variable and age, sex, and the age-by-cohort interaction as independent variables. The analyses were run both with the complete NORMENT dataset, as well as with the age-matched subset of the NORMENT dataset as described above.

2.3.6 Longitudinal associations between age and ICV

To test for longitudinal associations between age and ICV, we fitted linear mixed-effects (LME) models for each ICV estimation method. We entered baseline age, baseline age2, scanner, and sex as fixed effects. The primary variable of interest was time, which was defined as 0 for measurements at baseline and as time in years since baseline for follow-up measurements. We used individual intercepts as a random effect, in order to allow for between-participant variation in ICV estimates.

To fit the models we used the lmer function from the lmerTest package (https://CRAN.R-project.org/package=lmerTest) using the restricted maximum likelihood criterion (REML). The P-values were estimated using Satterthwaite's approximation. This approach to evaluating significance in LME models compared favorably to other approximations in Luke (2017). Annual Percentage Change (APC) was estimated by taking the estimated coefficient of the time term in the fitted model and expressing it as a percentage relative to mean ICV at baseline.

We fitted additional models to examine the influence of sex, height, and body weight on longitudinal age associations, including a sex-by-time, height-by-time, or weight-by-time interaction term in addition to the terms above. To test the influence of baseline age on longitudinal age associations, we included models with an age-by-time interaction term.

3 RESULTS

3.1 Outlier analysis

We identified and excluded 7 outliers in the NORMENT dataset (1.1% of sample) and 12 outliers in the cross-sectional UKB dataset (0.5% of sample). We also identified and excluded 63 longitudinal outliers for the UKB dataset (2.6% of sample). Two participants were marked both as cross-sectional and longitudinal outliers in the UKB dataset.

We found that eTIV was the largest single contributor of both cross-sectional outliers in the NORMENT dataset and longitudinal outliers in the UKB dataset. See Table S1 for demographic information on participants identified as outliers and Figure S1 for an UpSet plot depicting the ICV estimation methods for which outliers were identified.

The results of analyses without outlier correction were similar to those of the outlier-corrected analyses, with a slight decrease in the correlations between ICV estimation methods, especially between eTIV and SPM12. Specifically, the estimated cross-sectional and longitudinal associations with age were similar for with and without outlier correction.

3.2 Pearson correlation analyses

Pearson correlations between ICV estimation methods were overall high. The lowest correlation coefficient was observed between eTIV and FSL (r = .873, CI = [0.865, 0.882]), followed by FSL and CAT12 (r = .895, CI = [0.887, 0.902]). The highest correlation coefficients were observed between sbTIV and SPM12 (r = .969, CI = [0.967, 0.971]) and sbTIV and CAT12 (r = .961, CI = [0.958, 0.963]). See Figure 2 for a correlogram showing Pearson correlations with 95% confidence intervals.

Details are in the caption following the image
Correlations between each pair of ICV estimation methods in the combined dataset (n = 2981). The main diagonal shows the distribution of each ICV estimate, the lower diagonal shows scatter plots for each pair of ICV estimates, and the top diagonal shows Pearson correlations with 95% confidence intervals in brackets.

The Pearson correlation between the registration-based methods (eTIV and FSL) was lower (r = .873) than the pooled correlation between eTIV and the segmentation-based methods (pooled r = .936) and the pooled correlation between FSL and the segmentation-based methods (pooled r = .925). The pooled correlation between the segmentation-based methods (SPM12, CAT12, and sbTIV) was .959.

3.3 Bland–Altman analysis

See Figure 3 for Bland–Altman plots. We found that, on average, FSL systematically estimated lower ICV (63–143 ml negative bias) and sbTIV higher ICV (56–143 ml positive bias) compared with the other methods. We also observed statistically significant proportional bias, that is, correlations between ΔICV and mean ICV for most pairwise comparisons, as indicated by the regression lines in Figure 3. The presence of proportional bias indicates that the agreement between methods differs as a function of the mean ICV across estimation methods. The strongest proportional bias was seen in comparisons of sbTIV with the other ICV estimation methods, where the magnitude of correlations ranged from 0.07 to 0.30 (P < 10−4). Weaker proportional bias was seen for SPM12 compared with the other methods (excluding sbTIV), here the correlations between ΔICV and mean ICV ranged from 0.10 to 0.05 (P < 10−4). We also found weak proportional bias for CAT12 compared with eTIV (r = −.05, P < .01). Inspection of diagnostic plots revealed no unexplained nonlinearity in the residuals, indications of heteroscedasticity, or influential cases exceeding a Cook's distance of 1. See Figures S2 and S3 for diagnostic plots.

Details are in the caption following the image
Bland–Altman plots for each pair of ICV estimates in the combined dataset (n = 2981). The means of each pair of estimates (x-axis) are plotted against the percentage differences of the estimates, ΔICV (y-axis). The Pearson correlation coefficient between mean ICV and ΔICV is shown on the top of each plot along with its P-value.

3.4 Relative importance analysis

When including age, sex, height, body weight, and mean ICV across the ICV estimation methods in the comparison as explanatory variables, the explained variance of the total model ranged from 4.35% for the FSL–sbTIV difference to 30.05% for the SPM12–sbTIV difference. For the SPM12–sbTIV difference, a large proportion of explained variance was due to sex (9.81%) and body weight (10.05%). For the SPM12–CAT12 difference, a large proportion of explained variance was due to body weight (12.37%) and age (9.11%). Notably, mean ICV across methods explained 10.48% of the variance of the eTIV–sbTIV difference and 7.38% of the eTIV–SPM12 difference. Across pairwise differences, body weight, age, sex, and mean ICV across methods were the best explanatory variables for the differences in ICV estimates. Height explained a small proportion of the variance in most ICV differences, except for the SPM12-sbTIV difference (4.42%). See Figure 4 for bar plots showing the variance decomposition for each explanatory variable.

Details are in the caption following the image
Relative importance of sex, height, body weight, and age, and mean ICV across methods on the differences between each ICV measure in the combined dataset (n = 2981).

3.5 Cross-sectional associations between age and ICV

Linear models provided the best fit for both datasets and all ICV estimates, except for sbTIV in the UKB dataset where the age2 term was significant (P < .05) and the difference in AIC of the two models exceeded −2, that is, markedly lower for the quadratic model. Diagnostic plots revealed no violations of the standard assumptions of linear regression. See Figures S4 and S5 for diagnostic plots for each of the main linear regression models and Table S2 for the AIC scores for each model and ICV estimation method for both the NORMENT and the UKB datasets.

For the NORMENT dataset, we found significant positive cross-sectional associations with age (ranging from 12.0 to 85.8 years) for FSL (b = 1.22 ml/year, P < 10−6) and SPM12 (b = 1.01 ml/year, P < 10−5). These changes corresponded to a CS-APC of 0.086% for FSL and 0.069% for SPM12. Age was not significantly associated with any other ICV measure in the NORMENT dataset. For the UKB dataset, we found significant negative cross-sectional associations with age (ranging from 47.0 to 80.3 years) for all ICV estimation methods. The greatest CS-APC were seen for CAT12 (−0.107% CS-APC) and eTIV (−0.107% CS-APC). The smallest effect size was seen for FSL with an estimated −0.049% CS-APC. Where the CS-APC estimates are percentages relative to the mean ICV of each dataset.

In the full model, including body weight and height as independent variables in addition to sex and age, we found significant contributions of height for all methods. In the NORMENT dataset, body weight was also significantly associated with ICV for FSL (b = 1.27 ml/kg, P < .001), SPM12 (b = 1.38 ml/kg, P < 10−4), and sbTIV (b = 0.87 ml/kg, P < .05). In the UKB dataset, body weight was associated with ICV for FSL (b = 0.51 ml/kg, P < .01) and SPM12 (b = 1.04 ml/kg, P < 10−9), but not for eTIV, CAT12, or sbTIV. In the NORMENT dataset, the significant associations between age and ICV remained after additionally adjusting for body weight and height for FSL (b = 1.06 ml/years, P < 10−5) and SPM12 (b = 0.84 ml/years, P < 10−3). However, in the UKB dataset the cross-sectional associations with age were no longer significant for FSL, SPM12, and sbTIV when adjusting for body weight and height.

In the NORMENT dataset, we observed positive interactions between age and sex for SPM12 (b = 0.98 ml/years for males, P ≤ .05) and CAT12 (b = 1.07 ml/years for males, P < .05), indicating a more positive slope in males. In the UKB dataset, age-by-sex interactions were found only for FSL (b = −1.52 ml/years for males, P < .05), and in contrast to the findings in the NORMENT dataset, this showed a negative interaction between age and sex, indicating a more negative slope in males. We found no significant age-by-height interactions for any ICV estimation method. A significant age-by-weight association was observed for eTIV in the UKB dataset (b = 0.05 ml/years-by-kg, P < .05), but not for any of the other ICV estimation methods or in the NORMENT dataset.

In the analyses including a subset of 289 participants from the NORMENT dataset that were age-matched with the UKB dataset, we observed no significant effects of age for any of the ICV estimation methods in the linear model covarying for sex only. When we also adjusted for height and weight, we saw a significant effect of age only for SPM12 (b = 1.26 ml/years, P < .05).

For the direct comparisons of the age effects between the two datasets, we found significant age-by-cohort interactions for each ICV estimation method when using the complete NORMENT dataset. These interactions indicated a more positive effect of age in the NORMENT dataset compared with the UKB dataset. We also saw a main effect of cohort, indicating higher ICV estimates in the NORMENT dataset compared with the UKB dataset for all ICV estimation methods except sbTIV. When we used the age-matched NORMENT dataset, we found similar significant age-by-cohort interactions for SPM12 (b = 1.71 ml/years for NORMENT, P < .05), CAT12 (b = 2.21 ml/years for NORMENT, P < .01), and sbTIV (b = 1.63 ml/years for NORMENT, P < .05).

See Figures 5 and 6 for partial regression plots for each dataset where age and ICV are residualized with respect to sex and plotted against each other. See Tables S3 and S4 for further details concerning the main linear regression models used to test cross-sectional associations between ICV and age, and Tables S5 and S6 for the full models including body weight and height. See Tables S7 and S8 for the results of the age-matched analyses and Tables S9 and S10 for the results of the direct comparisons of age effects across datasets.

Details are in the caption following the image
Partial regression plots in the NORMENT dataset (n = 644) for the associations between age and ICV for each estimation method. The effect of sex has been regressed out for both age and ICV and the regression lines show the residual effect of age on ICV. As age has been residualized with respect to sex, the x-axis is centered at the sex-adjusted mean.
Details are in the caption following the image
Partial regression plots in the cross-sectional UKB dataset (n = 2337) for the associations between age and ICV for each estimation method. The effect of sex has been regressed out from both age and ICV and the regression lines show the residual effect of age on ICV. As age has been residualized with respect to sex, the x-axis is centered at the sex-adjusted mean.

3.6 Longitudinal associations between age and ICV

In the main longitudinal analyses, with fixed factors sex, baseline age, baseline age2, scanner, and time (years since baseline), in the UKB dataset (age range = 47.0–80.3 years), we found significant associations with time since baseline for all ICV estimates. Effect sizes ranged from an APC of −0.293% for sbTIV to −0.416% for CAT12. The longitudinal effects of time on ICV remained for all ICV estimation methods when we included height and weight as fixed factors. Where the APC estimates are percentages relative to the mean ICV at baseline.

We found significant interactions with time for sex and height for eTIV, but not for any of the other ICV estimation methods. For sex, the interaction indicated lower longitudinal reduction for male participants (b = 2.53 ml/years for males, P < 10−4). For height, the interaction showed a lower longitudinal reduction for taller participants (b = 0.11 ml/years-by-cm, P < .001). There were no statistically significant interactions between longitudinal ICV change and baseline age.

See Table S11 for details on the main LME model used to investigate longitudinal associations with age. See Figure 7 for spaghetti plots depicting longitudinal ICV change in the UKB dataset stratified by sex. See Figure S6 for histograms depicting the raw annual percentage change between baseline and follow-up for each ICV estimation method. Diagnostic plots are included as Figure S7.

Details are in the caption following the image
Normalized spaghetti plots depicting longitudinal change for each ICV estimate stratified by sex. The x-axis depicts the interscan interval normalized by time of first imaging session, the y-axis represents ICV change from baseline to follow-up as a percentage of the total ICV across both time points, and the color indicates positive ICV change in red and negative ICV change in blue. The black line depicts the linear trend of longitudinal ICV change.

4 DISCUSSION

We compared five commonly used ICV estimation methods, eTIV, FSL, SPM12, CAT12, and sbTIV, and tested their cross-sectional and longitudinal associations with age. Correlations were overall high, but we found notable differences between the estimation methods. In particular, age, body weight, sex, and mean ICV across methods explained a large proportion for the nonshared variation between several ICV estimation methods. Different cross-sectional effects of age were observed in the two datasets. In the UKB dataset (age range = 47.0–80.3 years), all ICV estimation methods indicated negative cross-sectional associations with age, whereas for the NORMENT dataset (age range = 12.0–85.8 years), we observed significant effects only for two of the ICV estimation methods indicating positive cross-sectional associations with age. Finally, we observed a striking convergence of results across ICV estimation methods in the longitudinal analyses in the UKB dataset, with an average of 2.3 years from baseline to follow-up, supporting past reports of longitudinal ICV reduction in mid- to late adulthood.

4.1 Outlier analysis

In the outlier analyses, we found that eTIV was the largest single contributor (i.e., outliers detected for eTIV, but no other ICV estimates) both for cross-sectional outliers in the NORMENT dataset (86% of outliers) and for longitudinal outliers in the UKB dataset (25% of outliers). In comparison, no participants were marked as outliers for sbTIV alone. This could indicate that eTIV is especially prone to measurement errors and requires a more comprehensive quality control procedure than the other methods, including assessment of the Talairach registration that is performed in the recon-all processing stream in FreeSurfer.

4.2 Correlations between ICV estimation methods

Contrary to our hypothesis, we found higher correlations for each of the registration-based methods, eTIV and FSL, with the segmentation-based methods (eTIV pooled r = .936; FSL pooled r = .925) than between eTIV and FSL (r = .873). The pooled correlation between the segmentation-based methods (SPM12, CAT12, and sbTIV) was .959. This points to greater internal consistency within the segmentation-based compared with the registration-based methods. While both eTIV and FSL estimate ICV via an atlas scaling factor, the target template for eTIV is the MNI305-derived fsaverage template (Evans et al., 1993; Fischl et al., 1999), whereas FSL uses the MNI152 template as its target (Mazziotta et al., 1995). Furthermore, FSL uses skull points to constrain the transformation to the template (Smith, 2002; Smith et al., 2001), which may give better registration results compared with eTIV (Buckner et al., 2004). Together, these differences could explain the relatively low correlation between eTIV and FSL, and the differences in their pairwise correlations with the segmentation-based estimation methods.

4.3 Quantitative differences between estimated ICV

Consistent with the previous literature, we found systematic quantitative differences between ICV estimates. In particular, FSL estimated lower ICV, while sbTIV estimated higher ICV than the other methods. Since registration-based methods calculate ICV as the product of an atlas scaling factor and the predetermined ICV of the template, under- or over-estimation of ICV can occur due to the estimated ICV of the template. This could explain the tendency of FSL to underestimate ICV, and a similar overestimation for eTIV relative to manually determined ICV has been reported previously (Klasson et al., 2018). As such, one should be cautious to interpret registration-based methods as absolute measures of ICV.

It is unclear why the segmentation-based method sbTIV appeared to overestimate ICV, but it is worth noting that the anatomical extent of the cranial cavity inferiorly along the spinal cord is not clearly defined, and systematic differences in the cut-off in the different ICV estimation methods may contribute to systematic volumetric differences. Therefore, while segmentation-based methods estimate ICV using a more direct approach, that is, probability-weighted voxel counts, these methods can also be subject to systematic biases and it remains important to assess their agreement with manually determined ICV if the goal is quantitative ICV estimation.

4.4 Associations between mean ICV and ICV differences

Bland–Altman analyses revealed widespread associations between mean ICV across methods and the differences between them. In other words, the agreement between ICV estimates differed as a function of the mean ICV across methods. This finding was supported by the relative importance analyses where mean ICV across methods explained a substantial proportion of the variance in the eTIV-sbTIV, eTIV-SPM12, SPM12-CAT12, and CAT12-sbTIV differences. Klasson et al. (2018) reported findings consistent with a bias due to head size on eTIV. If mean ICV is considered a proxy for head size, our results lend support to these findings and indicate that similar bias is also present for other ICV estimation methods. It should be noted that while our methods revealed systematic associations in the differences between ICV estimation methods and mean ICV, they cannot say which of the ICV estimation methods have a higher accuracy. We encourage future methodological studies on the validity of automated ICV estimation methods to also include manually determined ICV.

4.5 Relative importance of explanatory variables on ICV differences

In the relative importance analyses, we explored the explained variance of age, sex, body weight, height, and mean ICV across methods on the pairwise differences between ICV estimates. If ICV estimation is unbiased by these variables, we would expect to see low explained variance of the estimated ICV differences. Contrary to this expectation, we found a high proportion of explained variance for the total model in the comparison of SPM12 with sbTIV (30.05%), CAT12 (28.38%), and eTIV (29.61%), and in the comparison of sbTIV with eTIV (18.73%) and CAT12 (15.36%).

Body weight explained 10.05% of the variance in the SPM12–sbTIV difference and 12.37% in SPM12–CAT12 difference. This suggests a systematic impact of body weight on ICV differences between these estimation methods. Such bias can be grounds for concern in studies where the variable of interest is associated with weight differences. For example, weight gain is a known side effect of antipsychotic medication (Dayabandara et al., 2017).

We found that sex explained 9.81% of the SPM12–sbTIV difference. There are contradictory results in the literature on the presence and strength of sexual dimorphism in brain volumes and age-related associations with brain volumes (Fjell et al., 2009; Inano et al., 2013; Ritchie et al., 2018). It remains an open question whether sex-dependent volumetric differences remain after correction for ICV. Past studies have also shown that the statistical correction method can affect sex differences in brain volumes (Pintzka et al., 2015; Sanchis-Segura et al., 2020). The choice of ICV estimation method may explain some of the discrepancy in the past findings. We encourage future studies on sex dimorphism of brain volumes to carefully assess the accuracy of the chosen ICV estimation method, given that confounding by sex can affect results.

4.6 Cross-sectional associations with age

We found negative cross-sectional associations between ICV and age across all ICV estimation methods in the UKB dataset. Notably, associations with age did not reproduce across all ICV estimation methods in the NORMENT dataset. For this dataset, we only observed statistically significant associations between ICV and age for two (FSL and SPM12) out of five ICV estimation methods. This supports our hypothesis that the choice of ICV estimation method can, at least in some cases, cause discrepancies in reported associations between ICV and age. In post hoc analyses, we found statistically significant differences in the age effects in the two datasets for all ICV estimation methods indicating more negative age trajectories in the UKB dataset.

Unlike the study by Ma et al. (2018), we found statistically significant negative associations with age for all ICV estimation methods in the UKB dataset, which had a similar age range. While this study included a large number of MRI scans (7656 scans), the number of participants was less than that of the UKB dataset (1727 participants). The data was also acquired on both 1.5 T and 3 T scanners which could reduce the sensitivity of the analyses. Furthermore, their statistical models may not have been optimal for distinguishing between within-subject longitudinal associations and between-subject effects. It should also be noted that differences in the secular growth rates in the ADNI dataset compared with the UKB dataset may have contributed to the discrepancy with our results.

Since the mean age and the age range differed between the UKB and the NORMENT datasets, we performed additional analyses on a subset of 289 participants from the NORMENT dataset that were age-matched to the UKB dataset. In these analyses, none of the main models showed significant associations between age and ICV. However, in the full models, corrected for age, sex, height, and body weight, we saw a significant positive association between age and ICV for SPM12—in line with our findings in the complete dataset. Given the loss of statistical power in these analyses it is difficult to draw firm conclusions. However, the replication of the positive association with age for SPM12 in the full model, and the positive signs of the coefficients for the age term in the main models for FSL and SPM12, suggest that the age associations we observed were not entirely explained by differences in the age ranges. Furthermore, we observed significant age-by-cohort differences in the direct comparisons between the UKB dataset and the age-matched NORMENT dataset for SPM12, CAT12, and sbTIV, but not for eTIV and FSL.

We did not see evidence for nonlinear effects of age on ICV in the NORMENT dataset, and the quadratic age term was statistically significant only for sbTIV in the UKB dataset. Past studies on ICV and head size, including computed tomography and head circumference studies (Bergerat et al., 2021; Huda et al., 2004; Neubauer et al., 2009; Weaver & Christian, 1980), lend some support to our findings of linear age trajectories across most ICV estimation methods.

In the NORMENT dataset, we observed sex-by-age interactions for SPM12 and CAT12, indicating a greater cross-sectional ICV increase in males for SPM12 and CAT12. In the UKB dataset, on the other hand, we found a negative sex-by-age interaction for FSL, indicating that male sex is associated with greater cross-sectional ICV decrease for ICV estimated with FSL. Since the sex-by-age interactions did not replicate with the other ICV estimation methods, they may point toward sex-dependent bias to ICV estimation with these methods. These findings show that sex can, for some ICV estimation methods, affect the cross-sectional associations between age and ICV. This can be a particular concern in clinical group comparisons where sex distributions may vary between groups.

4.7 Longitudinal associations with age

In the longitudinal analyses, we observed statistically significant ICV reduction at follow-up compared with baseline across all estimation methods. The consistency of the longitudinal findings stands in contrast to the cross-sectional analyses in the NORMENT dataset where statistically significant associations with age were found for two ICV estimation methods, but not for the other three methods. This may be due to the superior ability of longitudinal data to resolve age trajectories (Sørensen et al., 2021). In particular, the estimation of longitudinal change has the advantage of being based on within-subject variation where potential biases from variables such as sex, height, and birth year may have a lesser influence on the results.

Our results indicated an APC ranging from −0.29% to −0.42%, which was considerably higher than the APC estimated by Caspi et al. (2020) where they reported an APC of −0.09% at age 55. However, their age range was from 16 to 55 years of age, and thus only partially overlapped with ours. Notably, they also included people with a diagnosis of schizophrenia and relatives of these participants in the analysis. It is known that ICV has a significant genetic component (Adams et al., 2016), and is affected by genes that confer risk for developing a psychotic disorder (Smeland et al., 2018). Furthermore, cross-sectional studies have found associations between a diagnosis of schizophrenia and ICV (Gurholt et al., 2020; van Erp et al., 2016). Thus, the sample used in Caspi et al. (2020) may be confounded by genetic risk of psychotic disorder, both directly in patients, and indirectly in relatives.

We observed significant sex-by-time interactions for eTIV, indicating less ICV decrease between baseline and follow-up in males, but no such interactions for the other ICV estimation methods. While this relationship was not replicated with the other ICV estimation methods, it is in line with a semi-longitudinal study of elderly participants (age range = 71.1–74.3 years) by Royle et al. (2013) where two sets of manual ICV segmentations were compared in a dataset of elderly participants. The first segmentation measured current ICV, whereas the other was derived from expert manual segmentation where inner table skull thickening (i.e., thickening of the inner bony structure of the skull) was taken into account. The segmentations were used to estimate the longitudinal effect of inner skull thickening on ICV change, yielding an estimated median ICV decrease of 6.2% in males and 8.3% in females across the lifespan, in line with the literature on physiological skull changes with age (Harding, 1949). However, given that eTIV was the only ICV estimation method that showed this sex-by-time interaction it may be due to biases inherent to eTIV. It is also unclear whether skull thickening occurs at a constant rate across the lifespan and consequently how much of this effect would be detectable in the age range of the UKB dataset. Furthermore, for eTIV alone we also found a significant height-by-time interaction indicating less longitudinal change in taller participants. It is possible that the sex-by-time interaction is driven by height rather than other physiological differences between males and females in this dataset.

We did not find statistically significant baseline age-by-time interactions for any ICV estimation methods, suggesting that the rate of longitudinal ICV change does not differ with age in mid- to late-adulthood. This finding should, however, be interpreted with caution, given the narrow age range of this sample. In line with the past literature on age-related ICV change, particularly in adolescence, it would be expected that the rate of change differs if longitudinal change is estimated for a greater age range.

4.8 Strengths and limitations

Strengths of our study include the use of large, well-characterized datasets composed of participants with no known neurological or psychiatric disorders thought to affect head size. In particular, the UKB dataset is the largest dataset used for assessing longitudinal ICV change to date. In the cross-sectional analyses, we had a broad age range suitable to assess the presence of nonlinear associations between age and ICV. The data was processed in a harmonized framework, facilitating the direct comparison between ICV estimation methods. For the NORMENT dataset, MRI acquisition was conducted on the same scanner system with no major upgrades.

Some limitations also apply. We processed a total of 5471 T1-weighted images (651 NORMENT scans + 4820 UKB baseline and follow-up scans) with five different ICV estimation methods. The resulting dataset of 27,355 ICV estimates precluded detailed quality control of each output. It also prevented us from performing manual segmentation of images, given its time-consuming nature (e.g., ~30 min per subject in Ambarki et al., 2012). Without manually determined ICV estimates, we cannot draw firm conclusions about which estimation methods are more accurate with respect to manual delineation. As with all longitudinal studies, systematic time-dependent bias (e.g., due to scanner drift) can affect the longitudinal results. Finally, our longitudinal dataset only covered a narrow age range, and we did not investigate longitudinal ICV change in adolescence and early adulthood.

5 CONCLUSIONS

Correlations between ICV estimation methods were lower between the registration-based methods than between segmentation-based methods, suggesting greater internal consistency for the segmentation-based methods. We found that sex, body weight, age, and mean ICV across methods explained a large proportion of the nonshared variation for some pairs of ICV estimation methods. This may represent bias that can be grounds for concern in clinical studies. We observed significant proportional bias for most pairwise comparisons, that is, varying agreement as a function of mean ICV across methods.

In the NORMENT dataset, spanning adolescence to old age, two ICV estimation methods were positively associated with age. In the UKB dataset, spanning mid- to late adulthood, all ICV estimates were negatively associated with age. The discovery of age-related effects only for two ICV estimation methods in the NORMENT dataset illustrate how the choice of ICV estimation method can affect findings of age-related associations with ICV. Relationships between ICV and age were significantly different between the two datasets for all estimation methods in the complete dataset, and for three methods in age-matched analyses. This may be due to different secular growth rates in the two cohorts. The convergence of longitudinal results across ICV estimation methods, in the largest dataset to date, offers strong evidence for longitudinal age-related ICV reductions in mid- to late adulthood.

In conclusion, the choice of ICV estimation method is a possible source of bias, not only in studies investigating ICV as a variable of interest, but also in studies that use ICV as an adjustment factor. We encourage future studies to investigate the validity of automated ICV estimation methods and implement quality control procedures to assess the accuracy of the ICV estimation method as with other morphometric brain measures.

AUTHOR CONTRIBUTIONS

Stener Nerland: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration. Therese S. Stokkan: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization. Kjetil N. Jørgensen: Methodology, Writing - original draft, Writing - review & editing. Laura A. Wortinger: Writing - original draft, Writing - review & editing. Geneviève Richard: Data curation, Writing - original draft, Writing - review & editing. Dani Beck: Writing - original draft, Writing - review & editing. Dennis van der Meer: Data curation, Writing - original draft, Writing - review & editing. Lars T. Westlye: Resources, Writing - original draft, Writing - review & editing, Funding acquisition. Ole A. Andreassen: Resources, Writing - original draft, Writing - review & editing, Funding acquisition. Ingrid Agartz: Resources, Writing - original draft, Writing - review & editing, Supervision, Funding acquisition, Project administration. Claudia Barth: Investigation, Writing - original draft, Writing - review & editing, Supervision, Project administration.

ACKNOWLEDGMENTS

This work was conducted within the Norwegian Centre for Mental Disorders Research, and was funded by the Research Council of Norway (grant numbers 213700, 223273, and 250358), the South-Eastern Norway Regional Health Authority (grant numbers 2017-097, 2019-104, and 2020-020), and the K.G. Jebsen Foundation (grant number SKGJ-MED-008). Access to the UK Biobank dataset was facilitated as part of application 27412. Image processing was performed on the TSD (Service for Sensitive Data) computing cluster and at the Imaging Psychosis morphometry lab at Diakonhjemmet Hospital. We are grateful to Sahar Hassani and Øystein Sørensen for their helpful input on the longitudinal analyses.

    CONFLICT OF INTEREST

    OAA has received a speaker honorarium from Lundbeck, and is a consultant for HealthLytix. The other authors report no financial relationships with commercial interest.

    DATA AVAILABILITY STATEMENT

    The data incorporated from the UK Biobank study is publicly available pending a standard registration procedure (https://www.ukbiobank.ac.uk/). Derived data for this dataset, i.e., estimated ICV for each estimation method employed in this study, will be made publicly available through the UK Biobank Returns Catalogue https://biobank.ndph.ox.ac.uk/showcase/docs.cgi?id=1). Given a formal data sharing agreement and approval from the local ethics committees, access to the NORMENT dataset can be provided.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.