Clinically Relevant Changes for Cognitive Outcomes in Preclinical and Prodromal Cognitive Stages

Background and Objectives Identifying a clinically meaningful change in cognitive test score is essential when using cognition as an outcome in clinical trials. This is especially relevant because clinical trials increasingly feature novel composites of cognitive tests. Our primary objective was to establish minimal clinically important differences (MCIDs) for commonly used cognitive tests, using anchor-based and distribution-based methods, and our secondary objective was to investigate a composite cognitive measure that best predicts a minimal change in the Clinical Dementia Rating—Sum of Boxes (CDR-SB). Methods From the Swedish BioFINDER cohort study, we consecutively included cognitively unimpaired (CU) individuals with and without subjective or mild cognitive impairment (MCI). We calculated MCIDs associated with a change of ≥0.5 or ≥1.0 on CDR-SB for Mini-Mental State Examination (MMSE), ADAS-Cog delayed recall 10-word list, Stroop, Letter S Fluency, Animal Fluency, Symbol Digit Modalities Test (SDMT) and Trailmaking Test (TMT) A and B, and triangulated MCIDs for clinical use for CU, MCI, and amyloid-positive CU participants. For investigating cognitive measures that best predict a change in CDR-SB of ≥0.5 or ≥1.0 point, we conducted receiver operating characteristic analyses. Results Our study included 451 cognitively unimpaired individuals, 90 with subjective cognitive decline and 361 without symptoms of cognitive decline (pooled mean follow-up time 32.4 months, SD 26.8, range 12–96 months), and 292 people with MCI (pooled mean follow-up time 19.2 months, SD 19.0, range 12–72 months). We identified potential triangulated MCIDs (cognitively unimpaired; MCI) on a range of cognitive test outcomes: MMSE −1.5, −1.7; ADAS delayed recall 1.4, 1.1; Stroop 5.5, 9.3; Animal Fluency: −2.8, −2.9; Letter S Fluency −2.9, −1.8; SDMT: -3.5, −3.8; TMT A 11.7, 13.0; and TMT B 24.4, 20.1. For amyloid-positive CU, we found the best predicting composite cognitive measure included gender and changes in ADAS delayed recall, MMSE, SDMT, and TMT B. This produced an AUC of 0.87 (95% CI 0.79–0.94, sensitivity 75%, specificity 88%). Discussion Our MCIDs may be applied in clinical practice or clinical trials for identifying whether a clinically relevant change has occurred. The composite measure can be useful as a clinically relevant cognitive test outcome in preclinical AD trials.


Discussion
Our MCIDs may be applied in clinical practice or clinical trials for identifying whether a clinically relevant change has occurred. The composite measure can be useful as a clinically relevant cognitive test outcome in preclinical AD trials.
The minimal clinically important difference (MCID) is defined as the smallest change on a measure that is reliably associated with a meaningful change in a patient's clinical status, function, or quality of life. 1 It is important to decide the smallest change in an outcome that constitutes a clinically meaningful changethat is, MCID-to interpret whether, for example, the treatment effect measured using a cognitive test is clinically relevant or whether a change in cognitive testing during a clinical followup represents a clinically meaningful change in cognition. MCIDs are thus necessary to make accurate clinical decisions and to design clinical trials with the statistical power to detect an effect equal to or greater than the MCID. 2 In the 2018 US Food and Drug Administration (FDA) guidance for clinical trials in early AD, the guidance introduced a clinical staging framework for AD stages 1-3. 3 Stage 1 includes individuals with abnormal biomarkers without cognitive complaints or detectable decline even on sensitive tests. Stage 2 includes individuals with subtle cognitive effects without functional deficits, and stage 3 includes individuals beginning to have difficulties with daily tasks. To presume a drug has a clinically beneficial effect for individuals in stage 2, the agency states that a pattern of beneficial effects on neuropsychological assessments is more persuasive if seen on multiple tests and that if only seen on 1 assessment, it needs to show a large magnitude of effect to be persuasive of a beneficial effect. However, for many cognitive assessments, the magnitude that corresponds to a clinically meaningful effect compared with that for placebo is unknown.
Several methods exist for estimating a meaningful clinical effect, among which the most well-established are anchor-based and distribution-based estimates. 4 Anchor-based approaches to determine meaningful within-patient change involve the use of an external reference with an already established relevance. 5 Distribution-based, or internal estimates, use statistical properties of the measures themselves, and of them, the most common are effect size metrics-for example, the SD and the SEM that incorporate some measure of scale reliability (e.g., test-retest or Cronbach α as a measure of internal consistency reliability).
In addition to establishing clinically relevant cutoffs for test changes, it is also important to determine which tests best represent clinically relevant changes. The preclinical Alzheimer cognitive composite (PACC) has earlier been proposed as an outcome measure sensitive for early cognitive changes in AD (stages 1 and 2). 6 The PACC was initially created by selecting 4 well-established cognitive tests that are sensitive to detecting change/worsening in prodromal and mild dementia and with sufficient range to also detect early decline in preclinical stages of disease. 6 However, the PACC was established purely based on the presumed sensitivity to detect changes and not whether the changes were clinically meaningful. We propose that by developing and validating cognitive composites and test batteries using predictive validity for a clinically important change incorporating anchor-based approaches, more relevant outcomes may be developed than by focusing on within-measure change/worsening, which is distribution-based alone.
The aims of this study were (1) to establish cutoffs for cognitive test changes for use to conclude whether a meaningful magnitude of treatment effect has been achieved and (2) to investigate which single and combinations of cognitive test differences best corresponds to a clinically meaningful decline. In addition to examining the second aim in cognitively unimpaired (CU) participants and participants with mild cognitive impairment (MCI), we also examined this in Aβpositive CU participants because this is a group of special interest in present and future clinical AD trials. 7

Population
The participants in the study were consecutively included from the prospective Swedish BioFINDER study (biofinder. se), and participants for this study were enrolled from July 6, 2009, to March 4, 2015. The population consisted of 451 CU individuals and 292 people classified as experiencing mild cognitive impairment (MCI). In the CU group, 90 individuals experienced subjective cognitive decline, and 361 people were cognitively healthy controls based on a structured assessment. The individuals were followed up longitudinally (for CU pooled mean for all different tests 32.4 months (pooled SD 26.8, range 12-96 months), MCI pooled mean 19.2 months (pooled SD 19.0, range 12-72 months), with a mean number of data points of 1588 for CU and 727 for MCI.
MCI was defined according to the performance on a comprehensive neuropsychological battery, as previously described. 8 All cognitively unimpaired indviduals had a Clinical Dementia Rating-Sum of Boxes at inclusion of 0. Participants with MCI were excluded after converting to major Glossary AD = Alzheimer Disease; ADAS = Alzheimer's Disease Assessment Scale; ADL = activities of daily living; AIC = Akaike Information Criterion; AUC = area under the ROC curve; CDR-SB = clinical dementia rating-sum of boxes; CU = cognitively unimpaired; ES = estimates of effect size; FDA = Food and Drug Administration; MCI = mild cognitive impairment; MCID = minimal clinically important difference; MMSE = Mini-Mental State Examination; PACC = preclinical Alzheimer cognitive composite; RCI = reliable change index; SDMT = Symbol Digit Modalities Test; SRM = standardized response mean; TMT = Trailmaking Test. neurocognitive disorder. Participants were assessed by physicians well experienced in dementia disorders, underwent a physical examination, MRI scan, lumbar punction, and cognitive assessments, and were rated with the CDR. Participants experiencing cognitive symptoms at baseline (subjective cognitive disease or MCI) were followed up annually, while participants without cognitive symptoms at baseline were examined every second year by physicians.

Cognitive Tests
Eight cognitive tests were examined in this study, covering the cognitive domains of executive function, attention, episodic and semantic memory, and visuospatial function. Participants were examined with the Mini-Mental State Examination (MMSE), the Alzheimer Disease Assessment Scale (ADAS) 10-word delayed recall, Letter S Fluency, Animal Fluency, Stroop Color and Word Test (Stroop), Trailmaking Tests A and B, and Symbol Digit Modalities Test. Further explanation of tests, what they assess, and how points are counted are described in eMethods.

Clinical Dementia Rating
Clinical Dementia Rating (CDR) is an ordinal scale with scores of 0-3 points used to quantify the functional effect of cognitive impairment (0 = none, 0.5 = questionable, 1 = mild, 2 = moderate, and 3 = severe) in domains (box scores) of memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care. 9,10 Participants in this study were rated on the CDR by a thorough review of patient records where dementia experts assessed cognitive symptoms and activities of daily living (ADL) in comprehensive semistructured interviews with patients and informants, including informant-based questionnaires of ADL (the functional activities questionnaire) 11 and cognitive symptoms (the CIMP-QUEST) 12 (supported if necessary by cognitive test results). The CDR scale provides a quantitative index of impairment, referred to as the Sum of Boxes or CDR-SB (range of 0-18), and may also be scored in the CDR global severity stage score (range 0-3) using an algorithm. An increase in the CDR-SB score has been identified as having both face and predictive validity to identify people who are later diagnosed with probable AD or another dementia. 13 For predementia AD, a single box score increment of either 0.5 or 1.0 has been proposed to capture efficacy and clinical relevance for early AD. 14 A change of 1 in CDR-SB could be either a change of 0.5 points in 2 boxes or a change of 1 point in 1 box.

Determining Amyloid Positivity
The procedure and analysis of CSF followed the Alzheimer Association Flowchart for CSF biomarkers. 15,16 We used the ratio for Aβ42/40 that we establish acquire through CSF analysis. CSF Aβ42 and Aβ40 were analyzed using the Roche Elecsys CSF immunoassays (NeuroToolKit) on all participants. The cutoff for Aβ42/40 was established with mixture modeling statistics 17,18 and set at 0.066.

Statistical Analysis
The psychometric criterion reliable change index (RCI) is used to evaluate whether a change over time of an individual score is considered statistically significant. 19 RCI provides a CI that represents the predicted changes that would occur if a patient's test score does not change significantly from one assessment to another. We calculated the RCI for all 8 test differences with a 90% CI (the most common CI for an RCI) 19 for CU participants and participants with MCI. This is performed for tests separately using the following equation: Estimates of effect size (ES) are useful for determining the magnitude or size of an effect, the relative contribution of different factors or the same factor in different circumstances, and the power of an analysis. 20 ES is defined as a mean difference in score divided by standard deviation of baseline scores. An ES of 0.5 is generally considered a moderate clinically significant change, whereas an ES of 0.2 is considered a small change and 0.8 a large change. 21,22 The standardized response mean (SRM) is an effect size used to measure the responsiveness of outcome measures (the ability to detect change over time), defined as mean difference in score divided by SD of the change from previous visit score. 23 ES and SRM were calculated for all test changes in CU and MCI participants. Experts have previously defined a clinically meaningful cognitive decline as a decline in cognitive function of 0.5 or more SDs from baseline cognitive scores, 24-26 which we present in our results.
For the anchor-based approach, we analyzed mean differences in cognitive test scores anchored to differences in CDR-Sum of Boxes (CDR-SB) scores. For the CU individuals, we used a change of CDR-SB ≥0.5 points and for the MCI group a change of ≥1 point as anchors to represent a clinically meaningful change. We calculated the mean, SD, and ES for changes in the cognitive tests separately for meaningful decline (CDR-SB difference of ≥0.5 and ≥1) and no meaningful decline (CDR-SB difference of 0).
According to previously described methods, MCIDs are recommended to be triangulated (calculated of the arithmetic mean) to produce a final MCID for each mean value. 27 Triangulation integrates results from ratings with clinical changes, statistical estimates, and qualitative data from patients and/or clinicians to derive guidelines. 28 A previously suggested method is to assign the anchor-based results a weight of two-thirds and the distribution-based method a weight of one-third. 29 The final triangulated MCID is then calculated as the mean values of these 3 parts. Our anchor-based MCIDs are estimated from ES (based on clinical changes measured with CDR), and distribution-based MCIDs are based on statistical measures (SEM). Because an ES of 0.5 is generally considered a clinically significant change, we used the estimated anchor-based MCID with the ES closest to 0.5.
To examine which tests that best represented a clinically meaningful change, we analyzed the cognitive tests as independent variables in logistic regression models with CDR-SB as a dependent variable. For the CU group, the CDR-SB difference was dichotomized as either 0 (no clinical change) or ≥0.5 point change (smallest clinically relevant change).
For the MCI group, we dichotomized with a larger CDR-SB difference as either 0 or ≥1 point; this excluded between 55 and 159 data points depending on the test because of a CDR-SB difference of 0.5. The area under ROC (AUC) curve and sensitivity and specificity for each test difference were calculated using ROC analyses. Logistic regression models were performed on a subsample with complete data for the analyzed cognitive tests (i.e., all logistic regression models were performed on the same population). To find the most optimal combination of test differences to estimate a cognitive change, we examined all cognitive test changes in the model for CU to identify a model with the lowest Akaike Information Criterion (AIC). AIC accounts for the trade-off between model fit and sparsity (as few included biomarkers as possible) to protect against model overfitting and can be used as a tool for model selection. 30,31 Lower AIC indicates a better model. To find the optimal combination for MCI, we excluded Animal Fluency, Letter S, and TMT B because these tests were only conducted every second year and therefore excluded many participants because of lack of complete cases. We added age at visit, education years, and gender in the models. Predictors with a p value >0.1 were removed from the model.

Standard Protocol Approvals, Registrations, and Patient Consents
The study was approved by the regional ethical committee at Lund University, Lund, Sweden. All participants gave their written informed consent to participate in the study.

Data Availability
Anonymized data will be shared upon request from a qualified academic investigator for the sole purpose of replicating procedures and results presented in the article and as long as data transfer is in agreement with the European Union legislation on the general data protection regulation and decisions by the Swedish Ethical Review Authority and Region Skåne, which should be regulated in a material transfer agreement.

Results
Baseline characteristics are summarized in   Table 2 and Figure 1 and triangulated MCIDs for CU and MCI participants in Table 3. Supplementary data for MCIDs (1/2 SD of baseline and SRM) are summarized in eTable 2 (links.lww.com/WNL/ C168).
Next, we examined how accurately differences in test scores could estimate a minimal clinically relevant change using logistic regression models (see Figure 2 and

Discussion
We have established minimally clinically important differences (MCIDs) for group-based worsening in test scores for 8 commonly used cognitive tests to help guide clinicians and researchers on clinically relevant cognitive decline with repeated assessments. We investigated changes in cognitive tests longitudinally with distribution-based and anchor-based methods. The distribution-based MCIDs were generally higher (i.e., a larger test change required to indicate an MCID) than the anchor-based MCIDs, showing the importance of using clinical measures of importance (such as CDR) according to the population. We found the best predicting model for a clinical change included differences in test results in ADAS delayed recall, MMSE, and TMT B for cognitively unimpaired (CU), Stroop, MMSE, and age for MCI, and      The novelty of this study is that we present MCIDs for several cognitive tests that, to our knowledge, has not been studied before, which could be used in future clinical AD trials for establishing clinical meaningful treatment effect for treatments seeking to prevent or slow disease progression. Besides, this study has the advantage of presenting triangulated data for MCIDs representing clinical changes, statistical estimates, and qualitative data from clinicians using CDR ratings. When triangulating MCIDs, we integrate results from ratings of clinical changes from ES (based on clinical changes measured with CDR) and statistical estimates (SEM). To our knowledge, previous studies have not investigated which tests best predict a cognitive decline using anchor-based methods, and in this study, we present this for CU individuals, individuals with MCI, and specifically amyloid-positive  That is, a participant can contribute with several test/CDR-SB differences. For example, 1 with assessments at baseline, 2 years, and 4 years will contribute with 2 data points (baseline to 2 years and 2 years to 4 years).
CU participants, which is the target population of several large ongoing AD trials.
The present findings are important because there is no previous consensus on MCIDs for cognitive test outcomes in AD trials; yet FDA specifically highlights that a clinically meaningful improvement on cognitive test scores should be shown before approval of the drug. 3 Recent trials on treatment for AD have investigated changes in cognition comparing individuals receiving placebo with those under active treatment.
In the EMERGE study, 2 the population receiving high-dose treatment with the antiamyloid treatment aducanumab reported a statistically significant reduced decline of 0.6 points on the MMSE between placebo and aducanumab groups favoring aducanumab. However, using our MMSE MCID (1.7 points) would render this mean change clinically insignificant.
In the TRAILBLAZER-ALZ 2 study for Donanemab including individuals with MCI-mild AD, they found a difference on MMSE of 0.64 between the placebo and Donanemab cohorts, which again would not be clinically significant.  Table 2). ES = estimates of effect size.
Previous studies have shown similar or larger MCIDs for MMSE compared with our results between 1 and 4 points; 32-35 however, we have not found previous estimated MCIDs for the other examined tests. One previous study on MCID for MMSE suggested a 0.4 SD change from baseline for MMSE as MCID, corresponding to an MCID of 1.4 MMSE points, 33 close to results from another study showing an MCID for MMSE of 1.6 points for 0.4 SD from baseline MMSE, 34 close to our calculated MCIDs (−1.5 MMSE points for the CU group and −1.7 for the MCI group). Another previous study showed an estimated MCID for MMSE of 1-3 points depending on disease severity, with larger results using only distributionbased approach similar to our study. 32 Yet another study has showed a far higher MCID of MMSE of 3.72 points. 35 We found a very low 0.5 SD of baseline MMSE (1.1 for CDR SB change of ≥0.5 in CU), which is partly caused by the inclusion criteria in the BioFINDER study for the CU group of MMSE score ≥28 points but does not explain why the MCI group had the same results. The estimated RCIs are much larger than MCID explained by the methodology with a large SD (1.65), being individual patient-based, differing from MCID as being minimal change at the group level. Much smaller changes may in fact be relevant as seen in our calculated MCIDs.
The CDR global has been used as an external anchor to establish meaningful change estimates for other scales. 36 While it has clinical validity as a meaningful change, progression from one stage to another represents a change that is much larger than what may be considered minimally important, which is why we have chosen to use the CDR-Sum of Boxes (CDR-SB) as an anchor for this study. Previous studies have also shown that to identify MCI, CDR-SB might be more accurate than global CDR. 37 Studies have reported a high internal consistency for the CDR-SB across the AD spectrum with a low variability in mean changes 38 and that mean scores decline nearly linearly. 39 In summary, we therefore chose to use CDR-SB as the anchor for determining clinically meaningful important differences in cognitive test results. In a recent study, it was shown that CDR-SB was not strongly correlated with the cognitive assessments MMSE or ADAS-Cog at baseline; however, there was a moderate correlation between change in CDR-SB and ADAS-Cog13 (r = 0.5) and MMSE (r = −0.4) at a 2-year follow-up. The same study showed that both CDR-SB and MMSE had a strong responsiveness to change. 10 41,42 Using our model selection approach, we could confirm that a combination of TMT B, ADAS delayed recall, Symbol Digit, and MMSE indeed not only are sensitive to cognitive changes over time as shown previously but also represent a clinically meaningful change. We did not find that changes in Animal Fluency were accurate in estimating a minimal meaningful decline. Overall, we found the best combined model of changes in cognitive tests with logistic regression models and found for amyloidpositive CU individuals the best model combined differences in cognitive test results in ADAS delayed recall, MMSE, TMT B, and Symbol digit combined with patient's gender, which includes all 3 cognitive domains. 6 We suggest that this technique could be used to develop other clinically relevant cognitive composites and test batteries for use in predementia populations, using broader cognitive test batteries to find the best model for predicting a cognitive change.
A potential limitation to the study is that the follow-up of participants is annual for MCI participants and every second year for most CU participants (annual for those with subjective cognitive symptoms at baseline), which might result in missing some fluctuation or decline in cognition in CU individuals. However, because the primary approach is based on  an anchor, this should not largely affect MCID estimates. In addition, any progression occurs slower and less frequent in CU participants, which is why the study was designed to have less frequent follow-ups for controls. and an increase in SD), and it should therefore be interpreted with caution. This sample-dependent nature is a challenge to the use of ES in general. An alternative would have been to use a CDR-SB difference of ≥0.5 in all cases as minimal and defined as the smallest difference a clinician is able to observe and score; however, we reason that the magnitude of the change in score could then potentially be too small to be clinically meaningful. In general, this is why we seek to use both anchor and distribution in generating estimates and not just the latter and give priority to the anchor.
Our triangulated MCIDs for cognitive test measures could potentially be applied in clinical practice to evaluate whether a clinical progression has occurred since last visit or whether the patient has remained stable. However, further work would be needed to define cutoffs representing possible scores on the instruments, as opposed to aggregate, group-level changes. The results from the logistic regression models (Figure 2 and eTable 3, links.lww.com/WNL/C168) suggests the suitable tests depending on setting (CU, MCI, or amyloid-positive CU) and Table 3 cutoffs that indicate that a meaningful change in the test has occurred. However, in clinical practice, MCIDs need to be rounded up to the nearest higher integer to evaluate differences. This selection of tests and identified cutoffs should however be validated in independent and more diverse populations with wider age range and education level. The MCIDs can also help to identify treatment benefits in clinical trials of therapies for early AD, and as we have reported earlier, several new studies on pharmaceutical treatments for AD have found significant changes in cognitive outcomes but may not be clinically relevant.
Drafting/revision of the article for content, including medical writing for content; study concept or design; analysis or interpretation of data