Benchmarking the generalizability of brain age models: Challenges posed by scanner variance and prediction bias

Abstract Machine learning has been increasingly applied to neuroimaging data to predict age, deriving a personalized biomarker with potential clinical applications. The scientific and clinical value of these models depends on their applicability to independently acquired scans from diverse sources. Accordingly, we evaluated the generalizability of two brain age models that were trained across the lifespan by applying them to three distinct early‐life samples with participants aged 8–22 years. These models were chosen based on the size and diversity of their training data, but they also differed greatly in their processing methods and predictive algorithms. Specifically, one brain age model was built by applying gradient tree boosting (GTB) to extracted features of cortical thickness, surface area, and brain volume. The other model applied a 2D convolutional neural network (DBN) to minimally preprocessed slices of T1‐weighted scans. Additional model variants were created to understand how generalizability changed when each model was trained with data that became more similar to the test samples in terms of age and acquisition protocols. Our results illustrated numerous trade‐offs. The GTB predictions were relatively more accurate overall and yielded more reliable predictions when applied to lower quality scans. In contrast, the DBN displayed the most utility in detecting associations between brain age gaps and cognitive functioning. Broadly speaking, the largest limitations affecting generalizability were acquisition protocol differences and biased brain age estimates. If such confounds could eventually be removed without post‐hoc corrections, brain age predictions may have greater utility as personalized biomarkers of healthy aging.


| INTRODUCTION
Establishing growth charts of normative brain development has been a longstanding objective for the neuroscience community. A standardized biomarker of brain development would have immense utility for identifying risks of adverse psychological outcomes, which may lead to more precise and personalized interventions. One approach toward establishing growth charts has been to leverage machine learning models that estimate the biological age of a person based on their brain structure or connectivity (i.e., brain age). Prior research using magnetic resonance imaging (MRI) has shown that brain age predictions can be fairly accurate when applying deep learning techniques (Kuo et al., 2021;Leonardsen et al., 2022), or more traditional machine learning methods (Cole & Franke, 2017), to multiple types of imaging modalities (Schulz et al., 2020) and derived brain features (Chen et al., 2020). Many studies also underscore how deviations between brain and chronological ages (i.e., brain age gaps) are useful for differentiating individuals as a function of cognitive impairment, psychopathology, and neurodegenerative disorders (Franke & Gaser, 2012;Gaser et al., 2013;Liem et al., 2017). Furthermore, longitudinal studies have demonstrated that brain age models can be reliable across time even when applied to clinical samples (Høgestøl et al., 2019;Richard et al., 2020). A few models from these studies were made publicly available with the intention that they would be applicable in novel research and clinical settings, yet systematic tests of generalizability are sparse.
The current study examined the out-of-sample predictions from two brain age models, which were chosen based on the size and diversity of their training data. One brain age model, termed as the Deep Brain Network  (DBN), used a 2D convolutional neural network that makes predictions from axial slices of T1-weighted images. The second model used gradient tree boosting  (GTB) to compute sex-specific predictions based on extracted features of brain volume, surface area, and cortical thickness from T1-weighted scans. Both models were developed and evaluated using large training and test samples (DBN: 17,410;GTB: 45,615), which were aggregated across many sites and scanner types and comprised individuals from 3 to 95 years of age. The published results from both studies demonstrated that brain age predictions were accurate when applied to many healthy developing subsamples and useful for differentiating between groups that exhibited different forms of pathology. The predictive power of both models was established using cross-validation, demonstrating that the models were generalizable across sites, sexes, and developmental stages Kaufmann et al., 2019).
Taken together, these two brain age models appear to have the most potential for becoming useful growth charts of brain development.
Despite the strengths of these brain age models, several challenges may hinder their utility. First, sampling biases can occur when characteristics of the training sample (e.g., age, sex, health status, etc.) are not evenly distributed across age bins (de Lange et al., 2022). Second, brain age gaps tend to regress towards the mean of their training sample, such that predictions for younger participants are more likely to be overestimated, while the ones for older participants are underestimated . Lastly, MRI scan properties may vary significantly due to differences in scanner types and acquisition parameters (Han et al., 2006;Jovicich et al., 2009), which can lead to biases when machine learning models are not trained and tested on balanced samples collected with similar protocols (Jonsson et al., 2019;Liem et al., 2017). The best way to account for these limitations is under debate (Butler et al., 2021), as such we evaluated the potential strengthens and challenges of each model under a variety of conditions. The generalizability of the DBN and GTB was assessed by applying them to three diverse early-developing cohorts (Luby et al., 2010;Somerville et al., 2018) that included both cross-sectional and longitudinal data from different scanner types and acquisitions protocols.
Each model was applied to scans with varying levels of image resolution, gray/white matter contrast, and signal-to-noise ratios (Magnotta et al., 2006;Sadri et al., 2020). Furthermore, a multi-faceted approach was taken to understand generalizability in terms of accuracy, reliability, and utility to detect individual differences in cognition. The differing sampling characteristics and acquisition parameters of our three cohorts may provide a stronger test of generalizability, despite their sample sizes being relatively small and not encompassing the full lifespan. These additional tests are essential for assessing a model's capacity to provide meaningful predictions in clinical settings, where sampling and imaging properties might vary considerably.
Based on prior machine learning (Poldrack et al., 2020;Varoquaux et al., 2017) and brain age studies (de Lange et al., 2022;Liem et al., 2017), we hypothesized that the out-of-sample predictions from both models would not be as accurate compared to their crossvalidation results Kaufmann et al., 2019). Specifically, we predicted that the models would encounter challenges due to sampling-bias, scanner variance, and prediction bias, which may result in systematic over or underestimations in brain age predictions. As such, we created additional variants of the DBN and GTB to understand how model performance changed when their training samples became more similar to our testing samples. Accordingly, a second set of the models was trained only using youths between the ages of 8 to 21 that were part of its original sample (see methods: "age-restricted models"). Since the training samples for the agerestricted models primarily consisted of Siemens Trio scans, another set of model variants was created by retraining the age-restricted models with Siemens Prisma scans (see methods: "retrained agerestricted models"). Given prior generalizability findings (Liem et al., 2017), we predicted that the retrained variants would yield the most accurate brain age estimates. To the best of our knowledge, there were not any prior studies to inform our hypotheses of model reliability or utility to detect differences in cognition.

| Overview of testing data sets
Three youth cohorts were utilized to ensure our findings would generalize across different sampling characteristics and acquisition parameters. These data sets were gathered from the Preschool Depression Study (Luby et al., 2010)

(PDS) and Human Connectome
Project in Development ) (HCP-D). The HCP-D is a cross-sectional multi-site study consisting of 789 participants that underwent MRI scanning and cognitive assessments. As discussed further below, half of the HCP-D sample was used to retrain each of the brain age models so generalizability could be assessed as the training and testing data became more similar. The remaining half (n = 394) was used to evaluate the generalizability of all brain age models (HCP-Test). The split-half procedure assured that both groups would be matched by age at scan, sex, and image quality metrics (i.e., Euler Number Rosen et al., 2018). The PDS is a 5-wave neuroimaging sample consisting of 167 participants who completed cognitive assessments in the final two waves. The first three waves of the PDS were completed with a Siemens TIM Trio scanner (sessions: 432), whereas the final two waves used a Siemens Prisma (sessions: 280). Given the differences in scanner types and availability of cognitive data, the two sets of PDS data were treated as different studies with the three waves referred to as PDS-Trio and the final 2 waves as PDS-Prisma.
Distributions of age differed across these three samples (Table 1), but all participants were youths between the ages of 8-22 years ( Figure S1).

| Imaging acquisition
The PDS-Trio was the only sample that was scanned on a 3 T Siemens TIM Trio with a 12-channel head coil. These magnetization-prepared, rapid acquisition gradient-echo (MPRAGE) T1-weighted images had the following acquisition parameters: 1 mm isotropic resolution; TR 2.4 ms; TE 3.16 ms; 160 sagittal slices; flip angle 8 ; FOV 256 Â 256 Â 224 mm 3 ; 6:18 acquisition time. The PDS-Prisma utilized a 3 T Siemens Prisma with a 32-channel head coil, but the acquisition parameters were identical to those from the PDS-Trio, except that the TE was lowered from 3.16 to 2.22 ms and the acquisition time was 20 seconds longer. Lastly, the HCP-D was also scanned on a 3 T Siemens Prisma with a 32-channel head coil. However, the HCP acquisition parameters were further optimized to enhance image quality: 0.8 mm isotropic resolution; TR 2.4 ms; TE 2.14 ms; 208 sagittal slices; flip angle 8 ; FOV 320 Â 320 Â 300 mm 3 ; 6:54 acquisition time. This protocol also included embedded volumetric navigators (vNavs) to correct for in-scanner head motion and minimize the impact of such artifacts .

| Preprocessing pipelines and quality assurance
All three test samples were processed using the exact procedures described in the research articles that computed the original brain age models (de Lange et al., 2022;Kaufmann et al., 2019). Briefly, the DBN utilized a preprocessing procedure that involved bias correction, multi-atlas skull stripping (Doshi et al., 2013)  Note: Descriptive statistics are reported for each numerical variable from both subsets of the PDS and the HCP-Test. The first three columns denote the sample mean and standard deviation in parentheses. One-way ANOVAs demonstrated that the test samples differed across most sample characteristics and all image quality metrics. There were slightly more males than females for the PDS (males = 87; females = 80) as well as the HCP sample (males = 414; females = 411).
were derived from the Philadelphia Neurodevelopmental Cohort (Satterthwaite et al., 2014) (PNC) and linear registration (Jenkinson et al., 2002) to the 1 mm MNI-152 template (Fonov et al., 2011). The resulting scans were divided into 80 horizontal slices before applying the 2D-convolutional neural network, which generated a brain age prediction for each slice. The median prediction was taken to represent the final brain age for a given scan. All brain age estimates were derived from a signal model, irrespective of sex-differences in brain development.
In preparation for the GTB, scans underwent automated surfacebased morphometry and subcortical segmentation using FreeSurfer (Fischl, 2012) (Dale et al., 1999). Sensitivity analyses within the HCP-Test sample were performed by also covarying for individual differences in image quality through the vNavs measure, which represents the number of re-acquired slices due to head movement during acquisition.

| Original brain age models (DBN / GTB)
Three variants of the DBN and GTB were analyzed to better determine how performance changed based on alterations in their training data; six brain age model were evaluated altogether. The first set of brain age models was applied without any changes to the training data. As Both models were optimized on their respective training data using fivefold cross validation, which resulted in robust correlations between chronological and predicted brain ages (DBN = 0.978; GTB = 0.935).
2.6 | Age-restricted brain age models (rDBN / rGTB) Given that the original models were trained across the lifespan, it could be problematic that the testing samples herein pertained exclusively to youths between the ages of 8 and 22 years. This concern led to the creation of a second set of DBN and GTB models, where the age range of the training data was restricted to only include scans that overlapped in age with the test samples. The resulting agerestricted training data was substantially reduced in size (rDBN = 1794; rGTB = 3382) and primarily consisted of scans from the PNC (Satterthwaite et al., 2014)  2.7 | Retrained age-restricted brain age models (tDBN / tGTB) A potential concern for the age-restricted model variations was regarding scanner type since the majority of PNC and PING data was collected using Siemens TIM Trio scanners and few were acquired from Siemens Prisma. To minimize the impact of potential scanner- ing, the GTB model was retrained from scratch by combining the novel HCP-D scans with the age-restricted training data.

| Statistical analysis of model accuracy
The accuracy of each brain age model was evaluated by the linear fit between brain and chronological ages, deriving the following metrics: slope, y-intercept, mean absolute error (MAE), and Pearson correlation. The slopes and y-intercepts are useful metrics for quantifying potential prediction biases or scaling effects, which reflect systematic over or underestimations in brain age predictions (Butler et al., 2021).
Theoretically, prediction bias and scaling effects would be less substantial as slopes approach one and y-intercepts approach zero. The goodness of fit for a given model improves as MAEs approach zero, though recent evidence suggests that moderately-fit models (MAE: 3-6 years) are most useful for detecting individual differences . It is worth noting that correlations weaken when the age range of the test sample is restricted (Poldrack et al., 2020), thereby limiting our ability to make comparisons with prior studies and between the HCP-Test (range: 16.6 years) and the PDS subsamples (Trio: 8.3 years; Prisma: 8.2 years). Given these challenges, we primarily assessed the goodness of fit for a model based on the prediction errors as opposed to Pearson correlation. Analyses of model accuracy were performed on raw brain age predictions, though supplemental analyses were conducted on "corrected" brain age predictions that underwent a post-hoc adjustment to remove any potential bias in brain age predictions (Smith et al., 2019).

| Statistical analysis of model reliability
Model reliability was assessed to investigate the consistency of both machine learning frameworks when applied to each test sample. As such, we quantified the degree of variability across the original, agerestricted, and retrained variants of the GTB and DBN respectively.
Deviation scores were computed by min/max scaling the brain age gaps for each variant and subsequently calculating the standard deviation from all three variants of the DBN and GTB on an individual basis (e.g., a deviation value for each scan). Generalized additive models were used to further evaluate whether individual differences in deviation scores could be explained by age, sex-differences, or image quality. To understand the potential confounding influence of prediction bias on model reliability, supplemental analyses were conducted using the "corrected" brain age gaps that underwent a post-hoc adjustment so that they would be orthogonal with chronological age (Smith et al., 2019).

| Statistical analysis of model utility
Model utility was operationalized by measuring each model's ability to detect individual differences in cognition. The raw brain age prediction was always the response variable, and the main predictor was an age-adjusted score for a given cognitive domain, while covarying for chronological age, Euler number, and sex. Given the co-linearity between age and image quality in youth (Rosen et al., 2018), these models were chosen to ensure that the brain age gaps would be orthogonal to head motion confounds and prediction bias. All models herein used linear regressions for the HCP-Test data and mixed effect models with random intercepts for each participant in the PDS-Trio and PDS-Prisma data sets. Analyses were performed using R version 4.0.2 (Team R. C, 2013). All code and model variants pertaining to this study have been made available through the following GitHub repository: https://github.com/ccplabwustl/RobertJirsaraie/tree/master/ proj20-BrainAgeEval.

| RESULTS
3.1 | Model accuracy: How similar are chronological and predicted brain ages?
Altogether, we examined prediction bias (i.e., slopes), scaling effects (i.e., y-intercepts), and goodness of fit (i.e., prediction errors) for six brain age models (three variants from each machine learning framework), which were applied to three early-developing cohorts (Table 2). the smallest prediction errors of all models. In addition to differences between models, performance varied considerably between early developing cohorts, whereby most models were least accurate when applied to scans acquired from Siemens Prisma scanners (PDS- The GTB was least susceptible to biases and overestimations, whereas the rDBN had the optimal amount of variation Note: Accuracy metrics are reported for all six brain age models, which were applied to three test samples. A slope of one and y-intercept of zero would indicate that brain age predictions were not systematically over or underestimated. A model's goodness of fit improves as MAEs approach zero and correlations approach one. However, moderately-fit models (MAE: 3-6 years) may be most useful for detecting individual differences. Correlations become weaker when the age range of a test sample is restricted, which may explain why the HCP-Test consistently yielded the strongest correlations relative to the PDS subsamples. The range of prediction errors are reported in parentheses next to the MAEs.

Prisma & HCP-Test). This imprecision between test samples may stem
from scanner-related variance, which also contributed to differences across all image quality metrics (p < 0.03) that were computed by the MRQy software package (Sadri et al., 2020) ( Figure S2).
Nearly all six models exhibited prediction bias when applied to each of the three early developing cohorts (Table S1). This is a wellknown limitation of the brain age framework (Jonsson et al., 2019) and it is common to account for such artifacts using post-hoc corrections. We followed this practice by linearly regressing out chronological age from the brain age gaps. As such, the modified brain age gaps were residualized with respect to age, which completely removed the previously reported issues of prediction bias and scaling effects. As expected, these modified brain age gaps also displayed much smaller prediction errors (median MAE: 0.88), though recent research has characterized such improvements as artificial (Butler et al., 2021) ( Table S2).
3.2 | Model reliability: How consistent are brain age predictions across model variants?
Model reliability was assessed by examining the amount of variation across all three variants of the same machine learning framework, which produced mixed results. In particular, the deviation scores of the GTB yielded smaller deviations across participants from the PDS-Trio sample, but the DBN yielded smaller deviations for the PDS-Prisma and HCP-Test samples (Table S3). Generalized additive models were used to further understand whether deviations in normalized brain age gaps varied as a function of age, sex, or Euler number (a proxy of image quality; Somerville et al., 2018). Reliability of the DBN and GTB were both robustly associated with age (Table S4), but these relationships were relatively more non-linear for the DBN (Figure 2a). The DBN was least reliable when scans were at the edges of the age range (youngest and oldest), whereas reliability of the GTB linearly improved with age ( Figure 2b). Reliability of the DBN was also associated with Euler Number (Figure 2c), suggesting that DBN predictions were more inconsistent when applied to lower quality MRI scans acquired from Siemens Prisma scanners. The reliability of the GTB variants was not related to image quality ( Figure 2d).
Supplemental analyses were conducted to understand how these reliability results would change when deviation scores were based on the corrected brain age gaps. The raw and corrected brain age predictions were moderately to strongly correlated (Table S5), though deviation scores from the corrected brain age gaps were much smaller due to the reductions in prediction errors following the post-hoc adjustment. Nonetheless, the DBN continued to be more reliable across participants from the PDS-Trio and the GTB was more reliable across the  Table 2). The PDS-trio sample is represented by green circles, the PDS-Prisma by blue triangles, and the HCP-test by brown squares HCP-Test sample (Table S3). These new sets of deviation scores also revealed even stronger relationships between model reliability and image quality (Table S6). However, the age-related differences in model reliability did not persist among deviation scores that were derived from age-corrected brain age gaps (Table S6).

| Model utility:
To what extent can brain age gaps detect differences in cognition?
Significant associations between brain age gaps and cognitive functioning have been reported in a prior study, which found the largest F I G U R E 2 Patterns of reliability exhibited age-related and image-quality differences within each test sample, which were unique for the DBN and GTB. (a) The variants of the DBN were most inconsistent among the youngest and oldest individuals in each test sample. (b) In contrast, the GTB variants were only the most inconsistent among the youngest individuals in each sample. (c) The reliability of the DBN predictions were significantly associated with image quality, which might be a down-stream consequence of using a minimal preprocessing pipeline. (d) The GTB variants yielded stable predictions that did not vary with image quality The brain age gaps derived by the DBN variants were most useful for detecting differences in cognitive functioning. These effects were replicated across multiple cognitive domains using two distinct test samples. The original DBN model yielded the most useful brain age gaps, which were associated with each of the cognitive domains displayed above. All significant effects indicated that underestimated brain age predictions were associated with better cognitive abilities effects with working memory and processing speed (Erus et al., 2015).
The current study attempted to replicate these relationships by examining five cognitive domains ( Figure S3) using linear and mixed-effects models that controlled for chronological age, sex, and image quality . The original and age-restricted DBN models were the most able to detect significant associations between brain age gaps and cognitive performance (Table S7). Of the five domains measured separately for the PDS-Prisma and HCP-Test samples (10 possible correlations), the DBN had seven significant relationships and the rDBN had five; all of which suggested that underestimated brain ages were correlated with better performance (Figure 3). All other models had no more than one significant relationship, which were also negatively correlated. Of the 15 significant relationships, four were detected with working memory, six with language and three with cognitive flexibility, suggesting that the models were most sensitive to individual differences in these three domains.

| DISCUSSION
In response to the widespread adoption and dissemination of brain age models, the current study benchmarked the generalizability of two models that were generated with the largest and most diverse training samples, which spanned from early childhood to late adulthood. We found that no single model outperformed across all facets of generalizability or all early developing test cohorts. Such findings present numerous trade-offs that can be used as a guide to maximize the utility of these two models. As detailed below, the GTB predictions were relatively more accurate overall and yielded more reliable predictions when applied to lower quality scans. In contrast, the brain age gaps from the DBN had the most utility for detecting differences in cognitive functioning.
Our accuracy results were generally not as optimistic as previous brain age research (Chen et al., 2020). Specifically, we observed much weaker correlations between chronological and predicted brain ages (0.29-0.84), relative to the original cross-validations that were performed across the lifespan (DBN; Bashyam et al., 2020: 0.98;GTB;Kaufmann et al., 2019: 0.94). Furthermore, only the retrained DBN and GTB yielded prediction errors that were analogous to prior neurodevelopmental studies (Brown et al., 2012;Erus et al., 2015;Niu et al., 2019) (Table 2). This challenge may be attributable to prediction bias and acquisition protocol differences between the train and test samples. Such confounds were not completely mitigated by any of the models in this study, emphasizing the need for better data aggregation and harmonization methods to achieve more generalizable models. This is in accordance with a previous study that applied ComBat harmonization (Fortin et al., 2017) and obtained more consistent brain age predictions across multi-scanner data .
Model reliability was defined as the amount of variation across models of the same machine learning framework, which produced mixed results that varied with scanner type. The DBN was most reliable when applied to test samples acquired from Siemens Prisma scanners (PDS-Prisma and HCP-Test), but the GTB was significantly more reliable when applied to the Siemens Trio scans (PDS-Trio). Patterns of reliability also exhibited robust age-related and image-quality differences within each test sample, which were unique for the DBN and GTB. The minimal preprocessing used by the DBN framework might contribute to it being more unreliable among lower quality scans, which was not exhibited by the GTB. Therefore, employing robust pipelines that extract neuroimaging features with higher signalto-noise ratios (Magnotta et al., 2006) may lead to more consistent results with less dependence on subtle differences in image quality.
However, implementing such methods in a clinical setting might come with practical challenges as they require processing time, computation resources, and programming expertise. The downstream implications of preprocessing and methodological choices on brain age predictions should be deliberated when building future models.
Model utility was assessed by how correlated brain age gaps were with cognitive functioning across five domains. The original and agerestricted DBN were the most able to detect individual differences in cognitive functioning, whereas the GTB variants only detected a few associations. These relationships with cognition were found using age-adjusted t-scores, indicating that the significant effects were not driven by prediction bias ( Figure S3). Yet, it was unexpected that the brain age gaps from the original DBN were the most sensitive to cog- We observed notable differences in our results when analyzing the corrected brain age predictions. Post-hoc corrections to remove prediction biases resulted in improved accuracy as the predictions errors were reduced such that most models became very tightly-fit (median MAE: 0.88). However, recent research suggests that these improvements may be artificial (Butler et al., 2021). There are also additional challenges when interpreting the corrected brain age predictions in terms of identifying which brain regions significantly contributed to a given predictive pattern. Understanding the specific features involved in the brain age predictions could improve our knowledge of the underlying brain maturation process while also making the automated system transparent to human verification. However, the link between brain features and age predictions becomes complicated once post-hoc corrections are applied. In addition, the best way to remove prediction bias is not clear (Smith et al., 2019).
Lastly, applying group-level calibrations to brain age gaps might not be feasible in clinical settings where assessments are made on anindividual basis. Given these challenges, it is essential to develop and validate brain age models that are less susceptible to prediction bias, thereby alleviating the need for post-hoc corrections altogether.
The current study contained several limitations. First, the age range of our test samples was at the lower limit of those used to train the DBN and GTB, which encompassed the entire lifespan. It is possible that different results might be obtained when evaluating these models with middle-age or late adulthood cohorts (Amoroso et al., 2019). Second, our accuracy results might have been worse than prior research, because most studies used cross-validations to evaluate a model's predictive power, but such methods are more optimistic compared to hold-out tests (Poldrack et al., 2020). Furthermore, it is challenging for us to interpret how our accuracy results compare to those from prior studies, because model accuracy depends on multiple factors, including age range, age distribution, sample size, specific accuracy metrics, and bias correction methods (de Lange et al., 2022;Smith et al., 2019). Third, the current study evaluated two machine learning frameworks that differed in their preprocessing methods, neuroimaging features, predictive algorithms, test samples, and retraining procedures. This diversity led to a more encompassing evaluation of the brain age framework, but it also presented challenges in narrowing down how each of these differences uniquely influenced brain age predictions. Subsequent studies may benefit from using ablation study designs, whereby comparisons are made between models with more similarities than differences.
The brain age framework has the potential to provide useful individual-level indices of brain development as long as its predictions are generalizable across diverse populations from all developmental stages. This study delineated numerous opportunities for improvement in the generalizability of brain-age models. The evaluated models have many practical uses provided that the biases revealed here can be accounted for (e.g., adjusted for systematic offsets in predicting age). Overall, the age-restricted DBN had reasonable accuracy and was the second most useful at detecting individual differences in cognition. The original GTB was the most accurate and its predictions were less susceptible to inconsistencies when applied to lower quality scans, but it was not as sensitive to differences in cognition. To conclude, the largest limitations affecting the generalizability of brain age models were acquisition protocol differences and prediction biases. If such confounds could eventually be removed without post-hoc corrections, brain age predictions may have greater utility as personalized biomarkers of healthy aging. We thank participants and their families for their participation in the early-life data sets evaluated here.

CONFLICT OF INTEREST
All authors report no biomedical financial interests or potential conflict of interest.