Sample sizes for brain atrophy outcomes in trials for secondary progressive multiple sclerosis
Citation Manager Formats
Make Comment
See Comments
Abstract
Background: Progressive brain atrophy in multiple sclerosis (MS) may reflect neuroaxonal and myelin loss and MRI measures of brain tissue loss are used as outcome measures in MS treatment trials. This study investigated sample sizes required to demonstrate reduction of brain atrophy using three outcome measures in a parallel group, placebocontrolled trial for secondary progressive MS (SPMS).
Methods: Data were taken from a cohort of 43 patients with SPMS who had been followed up with 6monthly T1weighted MRI for up to 3 years within the placebo arm of a therapeutic trial. Central cerebral volumes (CCVs) were measured using a semiautomated segmentation approach, and brain volume normalized for skull size (NBV) was measured using automated segmentation (SIENAX). Change in CCV and NBV was measured by subtraction of baseline from serial CCV and SIENAX images; in addition, percentage brain volume change relative to baseline was measured directly using a registrationbased method (SIENA). Sample sizes for given treatment effects and power were calculated for standard analyses using parameters estimated from the sample.
Results: For a 2year trial duration, minimum sample sizes per arm required to detect a 50% treatment effect at 80% power were 32 for SIENA, 69 for CCV, and 273 for SIENAX. Twoyear minimum sample sizes were smaller than 1year by 71% for SIENAX, 55% for CCV, and 44% for SIENA.
Conclusion: SIENA and central cerebral volume are feasible outcome measures for inclusion in placebocontrolled trials in secondary progressive multiple sclerosis.
Glossary
 ANCOVA=
 analysis of covariance;
 CCV=
 central cerebral volume;
 FSL=
 FMRIB Software Library;
 MNI=
 Montreal Neurological Institute;
 MS=
 multiple sclerosis;
 NBV=
 normalized brain volume;
 PBVC=
 percent brain volume change;
 RRMS=
 relapsing–remitting multiple sclerosis;
 SPMS=
 secondary progressive multiple sclerosis.
Definitive clinical trials of potential new diseasemodifying agents in multiple sclerosis (MS) often evaluate disability as the primary outcome measure. Because MS is characterized by a variable but generally slow clinical evolution, controlled studies with disability endpoints require large numbers of patients (several hundreds) to be studied over several years. Accordingly, there is considerable interest in developing surrogate laboratory markers of disease progression that, if more sensitive than disability, would enable trials to be performed more quickly and with fewer patients.
Irreversible and progressive disability in MS is likely due to neuroaxonal loss and demyelination, which occur in focal white matter lesions^{1} and also in normalappearing white^{2,3} and gray matter.^{4} MRImeasured brain atrophy has been proposed as a marker of progressive axonal and myelin loss,^{5} and it is now often acquired as an outcome measure in phase III trials.^{6–8} If brain atrophy is to be used as a reliable outcome measure in clinical trials, power calculations are required not only to determine the sample sizes needed to show therapeutic efficacy, but also to help identify the most suitable atrophy outcome measures, which is our primary aim here. In this report, based on data acquired in a multicenter sample of placebotreated subjects with secondary progressive MS (SPMS), we calculate and compare sample sizes required in a parallelgroup, placebocontrolled trial for SPMS subjects, using three brain atrophy outcome measures: a semiautomated measure of a regional (central) cerebral volume that has previously been used in MS cohorts^{9–11} and two wholebrain automated measures—SIENA and SIENAX—also used extensively.^{7,12,13} Two secondary aims are to contrast the sample sizes required for different trial durations and analyses and to examine the relationships between the three atrophy outcomes.
METHODS
Patients.
A substudy^{10} from five centers in a placebocontrolled trial of interferon beta1b in SPMS acquired 6monthly T1weighted brain MRI over 3 years. There were 46 placebotreated patients from the five centers (20 women, 26 men), 43 of which provided usable data. The mean age at entry was 40.9 years (SD 7.9 years), the mean disease duration was 13.4 years (SD 7.5 years), the mean time since evidence of progression was 3.8 years (SD 3.4 years), and the mean Expanded Disability Status Scale score was 5.2 (SD 1.1, range 3–6.5). These patients underwent 6monthly T1weighted spin echo MRI (repetition time 500–700 msec, echo time 5–25 msec, 256 × 256 matrix, 24cm field of view) for 3 years with 5mmthick contiguous axial slices acquired through the brain on each occasion.
Brain atrophy measures.
Central cerebral volume (CCV) was measured using an automated technique that segments cerebral tissue from surrounding scalp and other extracerebral tissue using a fourstep algorithm. The details of the methodology are described elsewhere.^{9,10} The slices were chosen with the most caudal being at the level of the velum interpositum cerebri. Four contiguous, axial, 5mmthick slices were studied. This region of the cerebral hemispheres was chosen because in a previous study 1) there had been substantial atrophy seen over an 18month period in subjects with SPMS^{9} and 2) the measure–reposition–rescan–remeasure coefficient of variability of the method was 0.56%.^{9}
SIENAX was used to measure normalized brain volume.^{14} SIENAX automatically segments brain from nonbrain matter, calculates the brain volume, and applies a normalization factor to correct for skull size. The normalization factor is obtained by registering the subject's scan to the Montreal Neurological Institute (MNI) 152 standard image using the skull to normalize spatially. Percentage brain volume change (PBVC) for each time point relative to baseline was measured using SIENA.^{14} SIENA registers the baseline and followup magnetic resonance image using the skull as scale and skew constraint, and then estimates the displacement of the brain edge for each point of the brain edge between these two scans. The brain edge displacements of all edge points are used to calculate the “overall” PBVC, which is expressed as a single value. Because not all scans included the full brain, the SIENAX and SIENA analyses were restricted to a prespecified interval along the zaxis, ranging from −52 to +60 mm in standard MNI152 space. When necessary, errors in brain extraction were corrected manually by a single experienced observer; this has been shown previously^{13} to reduce unwanted variability in SIENA and SIENAX results without materially introducing interobserver/intercenter variability; all scans required manual correction to a varying extent. SIENAX and SIENA are part of the FMRIB Software Library (FSL).^{15} All SIENAX and SIENA analyses were performed using FSL version 3.1.
Statistical methods and issues.
Sample size estimates were calculated for trial durations of 12, 24, and 36 months to detect treatment effects of 30%, 40%, 50%, and 60% at 80% and 90% power, all with a twotailed α (significance level) of 5%. Treatment is assumed to have an immediate and constant effect, and in the absence of a healthy control group treatment effects assume zero atrophy in healthy subjects, 100% equating with zero volume loss. For each duration, three standard statistical analysis methods were considered for the comparisons between active and placebo trial groups: 1) comparison of the mean change from baseline, using a t test; 2) comparison of baseline adjusted mean change from baseline, using analysis of covariance (ANCOVA)^{16}; and 3) comparison of mean rates of change estimated from longitudinal linear mixed models,^{17} using either 6monthly or annual time points. Relative efficiencies are used to summarize comparisons: the relative efficiency of procedure A vs B is the inverse of the ratio of the corresponding sample sizes required to achieve the same power. These methods are discussed further below, but technical details of the statistical models and calculations are given in appendix e1 on the Neurology® Web site at www.neurology.org.
A number of issues are relevant to the comparisons we present and to their potential impact on trial design. Chiefly, these relate to the choice of sample required to obtain valid comparisons between outcomes or between different trial durations or statistical analyses, and issues regarding outcome type.
Choice of samples for comparison.
For the primary comparison, between atrophy measures, best estimates come from subjects with all three measures available at a given time point, “allthree” samples. This ensures that differences between measures are not due to different subjects. For these comparisons, at different time points, sample sizes were calculated just for a 50% treatment effect (because the relative efficiency of the volume measures is approximately constant over different treatment effects for a given analysis method and duration). For any given trial duration and analysis method, this gives a valid comparison across the atrophy measures. For the simplest analysis method, the t test of changes, the nonparametric biascorrected bootstrap^{18} (1,000 replicates), was used to assess the statistical significance of sample size differences between the measures: standard errors for the differences in sample size estimates are not theoretically available, but in this context the bootstrap method gives a valid test, estimating confidence intervals for the differences empirically by multiple resampling (replicates) of the data. (p value ranges are given because of the computationally intensive nature of the bootstrap).
For best results within each individual measure and also for the secondary comparison between analysis methods and trial durations using a given measure, optimal estimates are given for each volume measure separately by fitting a longitudinal model using an “alldata” sample: the 36month duration 6monthly longitudinal model, which uses every available time point for that measure. Because the “allthree” samples have to drop a subject at a given time point if one of the three measures is missing, the “alldata” sample gives additional information on the robustness of the “allthree” comparisons to missing data. The estimated slope and variance parameters for the “alldata” model were then used to deduce the parameters relevant to the different statistical analyses and time points and thus generate the appropriate sample sizes. Thus, from the single set of “master” 36month parameters, we obtain a valid comparison of the different analysis methods and durations in each measure, assuming constant atrophy over the period. Under this assumption, these parameters also allow estimation of the effect of altering observation times. It has been shown^{19} that the timing of observations is relevant to gains in power, e.g., adding a third observation midway between baseline and final followup provides no additional information with which to estimate linear change. Though our primary aim is to compare the volume measures rather than establish optimal design, for interest we report some efficiency gains from a theoretically more efficient concentration of observations toward the trial period extremes.
Volume measures.
The methodology of SIENA, calculating the percentage brain volume change (PBVC), is a “direct”^{20} measure of change, with theoretically less measurement error compared to indirect measures of change obtained by numerical subtraction between volumes calculated at separate time points, as is required for CCV and SIENAX. The superior precision of SIENA compared with indirect volume measures has been noted previously in cohorts with relapsing–remitting MS (RRMS).^{21–23} However, direct difference methods have a different error structure than absolute measures, and this was taken account of in constructing the longitudinal models to estimate SIENA parameters.^{20}
To examine the concordance between the three measures, the “allthree” sample was used, with CCV and SIENAX converted into PBVC units using 100 × (volume at time point − baseline volume)/baseline volume. Pearson correlation coefficients and Bland–Altman plots^{24} were obtained, and the standard deviations of the measures were statistically compared using the Pitman test^{25} for paired variances.
RESULTS
Of the 46 patients available, a maximum of 43 patients were used in the analyses: 2 subjects were excluded having only SIENAX baseline and no other valid measurements (both dropped out at 6 months), and 1 subject with only baseline measures in CCV and SIENAX (6month scan electronic data rejected and then dropped out at 12 months) was also excluded. The patients provided a maximum of 246 data points for the analyses. From a theoretical maximum of 43 × 7 = 301 observations, 55 were missing: 25 because of patient dropout, 3 because of scan nonacquisition, 17 because of electronic data rejection, 1 because of hard copy (and therefore electronic data) rejection, and 9 because of unavailable electronic data. Table 1 shows the number of patients with all three measures available at any one time point, along with summary statistics of changes in volume from baseline and, for CCV and SIENAX only, absolute volumes and correlations between baseline and later volumes.
Concordance between the volume measures.
There was in general much better agreement between SIENA and CCV percentage changes than with SIENAX (table 1; figure). Concordance between the three measures is further detailed in appendix e2; figure e1, A–C; and figure e2, A–C.
Comparison of sample size estimates between the measures.
Table e1 gives the parameter estimates on which the sample size calculations for the “allthree” comparisons are based. (Details of the longitudinal parameters are given in appendix e1.) Longitudinal model residuals did not show any serious nonnormality. Table 2 shows sample size estimates for 50% treatment effect across the three measures, but the sample size ratios (relative efficiencies) within any single row would be the same for other treatment effects. SIENA has relative efficiencies between 2 (36month t test) and 2.5 (24month t test) compared with CCV and between 6.8 (36month longitudinal) and 31.8 (12month t test) compared with SIENAX. CCV has relative efficiency between 3.2 (36month longitudinal) and 15.2 (12month t test) compared with SIENAX. Bootstrap inference, for the pairwise differences in t test sample sizes between measures, showed that all sample size differences were p < 0.05: in particular, SIENA vs SIENAX gave p < 0.001 at all three durations; SIENA vs CCV gave 0.03 < p < 0.04 at 12 months, 0.004 < p < 0.005 at 24 months, and 0.01 < p < 0.02 at 36 months; and CCV vs SIENAX gave 0.001 < p < 0.002 at 12 months, 0.02 < p < 0.03 at 24 months, and 0.01 < p < 0.02 at 36 months.
Comparison of sample size estimates between analysis methods and trial durations.
Table e2 gives the parameter estimates underlying these sample size calculations. Table e3 shows the sample size estimates across the different analysis methods and trial durations, for each volume measure separately, allowing valid comparisons within the columns. For all measures, the most influential factor in determining sample sizes is trial duration. Minimum 2year sample sizes per arm for 50% treatment effect at 80% power were 32 for SIENA, 69 for CCV, and 273 for SIENAX and were 71%, 55%, and 44% lower than corresponding 1year sizes. Detailed comparisons between analysis methods and trial durations are presented in appendix e3. Key points are that adding an observation at the midpoint of the followup period does not add relevant information to the baseline and final scans, while the effect of additional informative (noncentral) time points for a given duration is greater the more variable the measure. Thus, additional informative time points have an impact for SIENAX, with its greater variability and lower correlation between times; but for CCV, and particularly for SIENA, adding time points between baseline and last followup gives little theoretical gain, even if the scans are clustered at the period extremes, provided there is negligible patient dropout.
DISCUSSION
Sample sizes based on four volume measures including SIENA^{21} and SIENA precision^{23} have been estimated previously in RRMS cohorts, reporting the superior precision of SIENA compared with indirect measures of volume change.
Our results show generally better agreement between CCV and SIENA than between either of these and SIENAX. Differences between CCV and SIENA may be because the latter is a registrationbased method directly measuring brain volume changes, whereas the former involves numerical subtraction. Additionally, these differences may be due to using a greater portion of the brain for SIENA. Nevertheless, there was good agreement between these two measures, particularly regarding longitudinal trajectory.
Comparing the three measures for the same analyses/durations gives highest sample sizes for SIENAX, followed by CCV and then SIENA, with the advantage of SIENA more pronounced at shorter durations. These results are explained by the comparative standard deviations of the three measures, relative to treatment effects. Although the variability of SIENAX absolute volumes, as a percentage of the volume, is actually lower than for CCV, the SIENAX changes have much higher variability than the other two measures, leading to higher SIENAX sample sizes for the analyses of changes. For the longitudinal models, sample sizes over shorter durations are dominated by the withinsubject standard deviation, which was highest relative to treatment effect for SIENAX and lowest for SIENA. Over longer durations, sample sizes are influenced more by the betweensubject atrophy rate standard deviation, which was again highest for SIENAX and lowest for SIENA. Although some patients were lost to the “allthree” sample underlying direct betweenmeasure comparisons, the general similarity in sample sizes from the “allthree” and the “alldata” samples suggest the betweenmeasure comparison is robust to patient loss.
Although in theory analyzing CCV with adjustment for baseline intracranial volume would only reduce the variability between subjects at baseline rather than of atrophy rates and, therefore, may not greatly enhance power in longitudinal studies, further work is required to assess the potential gains from such adjustment. Further work is also required to assess any change in power from calculating SIENA direct changes between consecutive time points, rather than from baseline as in these data; or from using ANCOVA to adjust SIENA for baseline SIENAX, though our data suggest little gain from this because ANCOVA results tend to approach but not improve on the corresponding longitudinal analysis with annual time points.
Detecting smaller treatment effects, or increasing test power, naturally increased the required sample sizes. Comparing analyses and durations, for all three measures, increasing the duration or the number of informative (i.e., not midway) time points reduced the required sample size, with increased duration generally having greater impact than number of time points. In general, “noisier” measures gain more than precise measures from an increase in the number of informative data points: thus, SIENAX gains the most from increasing the intrinsic power of the analysis by extending duration or adding points (particularly points toward the period extremes), followed by CCV, with the least gains for SIENA.
SIENA sample sizes for different trial durations have previously been estimated^{21} as 69 (1 year), 44 (2 years), and 40 (3 years), based on an RRMS cohort to be analyzed with t tests of change at 90% power and 50% treatment effect, close to our corresponding 77, 45, and 39 in an SPMS cohort (table e3). This might suggest that—despite of the use of different T1weighted sequences on which atrophy was measured (threedimensional in the RRMS group, twodimensional in the SPMS group)—the average rate of brain atrophy and its variance between subjects may be similar in RRMS and SPMS cohorts.^{26} The SPMS cohort in our European trial of interferon beta1b had more ongoing relapses and a shorter disease duration than the SPMS cohort that took part in a North American trial of interferon beta1b,^{27} and further work might investigate sample sizes in a longerdiseaseduration nonrelapsing SPMS cohort.
One assumption that may exaggerate the study power is that 100% treatment effect equates to zero volume loss. However, healthy controls experience some brain volume loss (0.1%–0.3% per year), and if diseasespecific treatment effects do not affect the “normal” atrophy associated with aging, a larger sample size will be required to show the same diseasespecific effect. If 0.1% “healthy” annual loss is assumed, the SIENA sample size of 28 required for a 50% treatment effect, 80% power 3year longitudinal analysis increases to 33; if 0.3% is assumed, the new sample size is 50. This effect might be allowed for in analysis models where healthy controls are scanned using the same protocol.
Determining optimal trial design has to take careful consideration of issues such as dropout rate and scanning burden on patients, and is outside the scope of this article; we can here only highlight relevant factors. It is important to note that the relatively small gain in power for SIENA and CCV shown by multi–time point longitudinal analyses compared with t tests and ANCOVA conceals an important advantage of the more sophisticated models: missing one data point at either baseline or final followup will remove a subject from the simpler analyses, whereas the longitudinal models can use all available data points efficiently and thus minimize the impact of missing data, in terms of both power and potential bias from differential dropout. Possible dropout toward the end of followup may also limit the power gains from timing scans near the trial end rather than spacing them regularly.^{19}
We assumed a linear volume change over time. Testing for nonlinearity, we found weak evidence of trajectories leveling off over time, consistent with a proportionate change, which is linear on a logarithmic volume scale. As a precaution, we repeated the sample size calculations on the log outcomes, but obtained sizes almost identical to those we report for SIENAX and SIENA and around 10% greater for CCV (probably because the changes tend to be larger as a proportion of absolute volumes for CCV than for the other measures). Further work on larger data sets would be required to assess possible nonlinearity satisfactorily.
For CCV and particularly for SIENA, extending the trial duration from 2 to 3 years reduces sample sizes relatively modestly. In contrast, extending the duration from 1 to 2 years can roughly halve the sample sizes required for these outcomes. A further disadvantage of 1year duration is the possible shortterm effect of biologic confounds tending to undermine sample size calculations, which, as here, assume immediate onset and constancy of treatment effect. First, any wallerian degeneration from axonal injury before the commencement of treatment may continue to evolve, and thus cause atrophy, for several months after the start of treatment, possibly delaying any treatment benefit from manifesting as reduced atrophy rate. Second, if the therapy has an antiinflammatory as well as a neuroprotective effect, it may cause an initial decrease in brain volume due to resolution of inflammation. Such an effect has been proposed to contribute to decreases in brain volume seen after treatment with IV methylprednisolone,^{28} beta interferon,^{6,29} and natalizumab.^{8} To avoid these confounds, baseline for analysis could be taken after an initial treatment “burn in” period. The appropriate interval is uncertain, but 3 or 6 months might be considered reasonable.^{29}
AUTHOR CONTRIBUTIONS
Statistical analysis was conducted by D.R.A.
ACKNOWLEDGMENT
The authors thank Stenmar van Steenbrugge for assisting in the SIENA and SIENAX analyses and Chris Frost and Jonathan Bartlett for their statistical advice.
Footnotes

Supplemental data at www.neurology.org
Editorial, page 586
ePub ahead of print on November 12, 2008, at www.neurology.org.
The Nuclear Magnetic Resonance Research Unit is partly supported by The Multiple Sclerosis Society of Great Britain and Northern Ireland. The Multiple Sclerosis Centre Amsterdam is supported by the Dutch Foundation for MS Research (grant 05538c).
Disclosure: Bayer Schering Pharma AG supported the data collection for this study. F.B., M.F., P.M., C.H.P., and D.H.M. have received honoraria from Bayer Schering Pharma AG (less than $10,000). K.W. and K.B. are current employees of Bayer Schering Pharma AG.
Received May 19, 2008. Accepted in final form August 20, 2008.
REFERENCES
 ↵
 ↵

Kutzelnigg A, Lucchinetti CF, Stadelmann C, et al. Cortical demyelination and diffuse white matter injury in multiple sclerosis. Brain 2005;128(pt 11):2705–2712.
 ↵
 ↵
Miller DH, Barkhof F, Frank JA, Parker GJM, Thompson AJ. Measurement of atrophy in multiple sclerosis: pathological basis, methodological aspects and clinical relevance. Brain 2002;125:1676–1695.
 ↵
Rudick RA, Fisher E, Lee JC, Simon J, Jacobs L. Use of the brain parenchymal fraction to measure whole brain atrophy in relapsingremitting MS. Multiple Sclerosis Collaborative Research Group. Neurology 1999;53:1698–1704.
 ↵
 ↵
Miller DH, Soon D, Fernando KT, et al. MRI outcomes in a placebocontrolled trial of natalizumab in relapsing MS. Neurology 2007;68:1390–1401.
 ↵
Losseff NA, Wang L, Lai HM, et al. Progressive cerebral atrophy in multiple sclerosis: a serial MRI study. Brain 1996;119:2009–2019.
 ↵
Molyneux PD, Kappos L, Polman C, et al. The effect of interferon beta1b treatment on MRI measures of cerebral atrophy in secondary progressive multiple sclerosis. Brain 2000;123:2256–2263.
 ↵
 ↵
 ↵
 ↵
 ↵
Goldstein H. Multilevel Statistical Models. Kendall's Library of Statistics Series
 ↵
 ↵
 ↵
 ↵
 ↵
Sormani MP, Rovaris M, Valsasina P, Wolinsky JS, Comi G, Filippi M. Measurement error of two different techniques for brain atrophy assessment in multiple sclerosis. Neurology 2004;62:1432–1434.
 ↵
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307–310.
 ↵
Armitage P, Berry G, Matthews J. Statistical Methods in Medical Research,
 ↵
 ↵
Panitch H, Miller A, Paty D, et al. Interferon beta1b in secondary progressive MS: results from a 3year controlled study. Neurology 2004;63:1788–1795.
 ↵
Rao AB, Richert N, Howard T, et al. Methylprednisolone effect on brain volume and enhancing lesions in MS before and during IFNbeta1b. Neurology 2002;59:688–694.
 ↵
Hardmeier M, Wagenpfeil S, Freitag P, et al. Rate of brain atrophy in relapsing MS decreases during treatment with IFNbeta1a. Neurology 2005;64:236–240.
Disputes & Debates: Rapid online correspondence
REQUIREMENTS
If you are uploading a letter concerning an article:
You must have updated your disclosures within six months: http://submit.neurology.org
Your coauthors must send a completed Publishing Agreement Form to Neurology Staff (not necessary for the lead/corresponding author as the form below will suffice) before you upload your comment.
If you are responding to a comment that was written about an article you originally authored:
You (and coauthors) do not need to fill out forms or check disclosures as author forms are still valid
and apply to letter.
Submission specifications:
 Submissions must be < 200 words with < 5 references. Reference 1 must be the article on which you are commenting.
 Submissions should not have more than 5 authors. (Exception: original author replies can include all original authors of the article)
 Submit only on articles published within 6 months of issue date.
 Do not be redundant. Read any comments already posted on the article prior to submission.
 Submitted comments are subject to editing and editor review prior to posting.