Effect of training and different measurement strategies on the reproducibility of brain MRI lesion load measurements in multiple sclerosis
Citation Manager Formats
Make Comment
See Comments

Abstract
In this study, we evaluated the intra- and interobserver variabilities in measuring lesion load of brain MRI abnormalities present on proton-density scans from patients with MS, using using both manual outlining or a semiautomated local thresholding technique (LTT). We also evaluated how these variabilities were affected by the use of standard rules for lesion load measurements, training, and different measurement strategies. The intraobserver variabilities obtained after establishing rules for lesion load measurements and training were not significantly different from those obtained before any consensus among the observers, both for manual outlining and for the LTT. On the contrary, the interobserver variabilities obtained with manual outlining or the LTT were significantly reduced when rules for lesion load measurements were used. For manual outlining, the intraobserver variability did not significantly change when the measurements were performed after an experienced radiologist identified lesions or when using adjacent slices and the corresponding T2-weighted images as reference for lesion identification. On the contrary, for the LTT, the intraobserver variability was significantly reduced by the use of the radiologic marking. The interobserver variabilities for both manual outlining and the LTT were reduced compared with the free condition when these measurement strategies were used. Our findings demonstrate that both lesion identification and outlining are important sources of variation for MRI lesion load measurements in MS and that there are simple strategies to reduce such variation that might be useful when planning clinical trials.
Quantitative assessment of lesion load on brain MRIs from patients with MS is increasingly being used to monitor disease evolution, either natural or modified by treatment.1,2 The expected yearly change of lesion load in patients with MS has been estimated from the placebo arm of previous clinical trials to be around 10%.3 It has already been demonstrated that several factors markedly influence the reproducibility of lesion load measurements of brain MRI abnormalities from patients with MS, including intrapatient biological variations,4 the use of multiple MR scanners,5 different acquisition variables,6-8 segmentation techniques,9,10 and accuracy of repositioning.11,12 Because the magnitude of the variability introduced by all these factors may make it impossible to detect any reliable lesion load change over time, several strategies to reduce the effect of these sources of variations have also been suggested and developed.4,5,12
There are, however, at least two other factors, operator training and measurement strategy, that are likely to be major sources of variation when measuring lesion load present on MRIs from MS patients. These aspects have not yet been systematically investigated. Techniques to segment MS lesions are based on two steps: lesion identification and lesion delineation.1 However, no efforts have been made to establish common rules for measurements or to evaluate the effect on reproducibility of measurements due to the two steps.
In this study, we established some common rules for lesion load measurements of brain MRI abnormalities in patients with MS and evaluated the effect of training and different measurement strategies on intra- and interobserver reproducibility of such measurements using both manual outlining and a semiautomated local thresholding technique (LTT).
Methods. Effect of training. Five patients with clinically definite MS13 in a long-term natural history study in Milan were selected; they had a wide and representative range of brain MRI abnormalities. For each patient, three slices of the proton-density scans (TR/TE = 2000/50; 24 contiguous interleaved axial slices with a slice thickness of 5 mm; field of vision [FOV] = 250 mm; matrix size = 256 × 256; 1 excitation) obtained using a magnet operating at 1.5 T were chosen, one showing the posterior fossa structures, one the supratentorial perivetricular areas, and one the cortex and the subcortical white matter. Five observers (MF, MGC, CG, JHvW, and JG) in five different centers measured the abnormalities present on these slices on two occasions (separated by an interval of 1 month), using manual outlining and the LTT. These observers had considerable experience using both techniques to segment MS lesions but had not developed any consensus regarding their application. Thus, for the first set of measurements, each observer used a different approach based on personal experience. Then, all observers met for a full-day session to discuss any discrepancies among the measurements. Rules to reduce measurement variability were discussed and formalized by one of them (MF) (Appendix). These rules, which included the definition of a common strategy for measurements and guidelines indicating standard procedures to be followed for each step of the measurement process, were circulated among the five observers, who again repeated the two sets of measurements on the same material with the same time schedule and segmentation techniques used in the untrained situation.
Effect of different measurement strategies. Five patients with clinically definite MS,13 who were part of the placebo arm of a clinical trial carried out in London and Amsterdam, were selected with the same criteria used to select the first group of patients. For each patient, three slices (located approximately at the same anatomic levels of the first set) of the proton-density scans (TR/TE = 2000/34; 24 contiguous interleaved axial slices with a slice thickness of 5 mm; FOV= 250 mm; matrix size - 256 × 256; 1 excitation), obtained using a machine operating at 1.5 T, were measured. These slices were distributed to the same five observers together with the corresponding T2-weighted images(TR/TE = 2000/90; 24 contiguous interleaved axial slices with a slice thickness of 5 mm; FOV = 250 mm; matrix size = 256 × 256; 1 excitation) and the two adjacent proton-density and T2-weighted slices to increase confidence in lesion identification. The five observers were asked to measure the abnormalities present on these slices on two occasions (separated by an interval of 1 month), using the same two segmentation techniques. Three months later, the lesions present on such images were identified by a radiologist(MGC) using a small square region of interest (ROI) placed in the center of each lesion and the same five observers were asked to repeat again the entire measurement procedure.
Statistical analysis. The effect of training and different measurement strategies on measured lesion loads, defined as the total lesion volumes present on the scans from each patient, were evaluated separately using the two different sets of images. Mean lesion loads (averaged from the measurements of the five observers) obtained before and after training and with different measurement strategies were compared with the Student's t-test for paired data. The analysis of the intraobserver variability of each condition was performed using a two-way ANOVA correcting for observer and patient effects. To evaluate the interobserver variability of each condition, the two measurements obtained by each observer in each condition were averaged. Then, the interobserver variability was estimated by a one-way ANOVA correcting for patient effect. All estimated variances were compared using the F-test. The ANOVA was applied to logarithmically transformed data to account for heteroscedasticity. Indicating as "s" the antilog of the SDs of the transformed volumes, we obtained our measure of variability defined as proportional variation (PV) = s - 1. Differences were considered statistically significant when p < 0.05.
Results. Effect of training. As expected (seeAppendix), the mean lesion load volume was significantly lower when the rules for lesion load measurements were used, both for manual outlining (p < 0.001) and the LTT (p < 0.001)(table 1).
Table 1 Mean lesion volumes (mm3) obtained with and without rules
In table 2, the intraobserver variabilities obtained before and after the use of the rules for lesion load measurements are reported. The differences found were not significant both for manual outlining (F25,25 = 1.53, p > 0.10) and for LTT(F25,25 = 1.10, p < 0.25).
Table 2 Intra- and interobserver proportional variation (PV) for lesion load measurements obtained with and without rules
On the other hand, the interobserver variabilities obtained with manual outlining or the LTT were reduced when rules for lesion load measurements were used: the PV for manual outlining was 34% without rules and 23% with rules (F20,20 = 2.12, p = 0.05) and the PV for the LTT was 24% without rules and 13% with rules (F20,20 = 2.05, p = 0.05) (see table 2).
Effect of different measurement strategies. The mean lesion volume obtained in the second set of scans when measurements were performed with the use of adjacent slices and the corresponding T2-weighted images was not significantly different from that obtained when measurements were performed with the use of radiologic marking (table 3).
Table 3 Mean lesion volumes (mm3) obtained with and without radiologic marking
For manual outlining, the intraobserver variability was similar for the two measurement strategies (F25,25 = 1.07, p > 0.25), whereas for the LTT, the intraobserver variability was significantly reduced by the use of the radiologic marking (F25,25 = 2.33, p< 0.05) (table 4).
Table 4 Intra- and interobserver proportional variations for lesion load measurements obtained with and without radiologic marking
The interobserver variabilities for both manual outlining (F20,20= 1.12, p > 0.10) and the LTT (F20,20 = 1.24, p > 0.10) were not different when adjacent slices and the corresponding T2-weighted images or the radiologic marking were used (seetable 4). However, the PV for both measurement strategies were lower than those obtained in the untrained situation.
Discussion. In this study, we evaluated the intra- and interobserver variabilities in measuring brain MRI lesion loads in MS using manual outlining and and LTT. We also evaluated whether training and different measurement strategies reduced these variables.
Effect of training. The intraobserver variability we found in the free condition was slightly higher than expected from previous studies.2,9,10 There are three possible explanations for this finding. First, the first set of slices provided to the observers was particularly difficult to measure because heavily T2-weighted images and contiguous slices were not available. Both are available in the daily-life situation and will undoubtedly increase the degree of certainty in lesion identification for equivocal areas of increased signal. Second, because only three slices per patient were measured, the total lesion loads were much lower than those normally detected in MS. This is another possible source that can artificially increase the variability of the measurements, because for high lesion loads, although not yet proven, it is likely that overestimates and underestimates of lesion sizes can counteract each other. Third, four of five observers were asked to measure MRI abnormalities on images from scanners different from those they were used to. This situation is realistic, because all phase III trials in MS are multicenter and thus using scanners from different manufacturers. As already suggested,5 it is possible that the degree of experience in measuring images from a single manufacturer is associated with better measurement reproducibility.
Nevertheless, the interobserver variability for both segmentation techniques was much higher than expected.2,9,10 Because the observers who took part in this study all have considerable experience in measuring MRI lesion load in MS, it is conceivable that, in addition to the abovementioned factors the use of different measurement strategies should be considered an important source of variation.
During the training session, the main source of variation was lesion identification in patients with lower lesion loads and both lesion identification and outlining in those with higher lesion loads. Differences in lesion identification were especially noted in the posterior fossa (some observers missed "low intensity" lesions, whereas other observers included some areas of hyperintensity that reasonably could be considered as flow artifacts) and in cortical/subcortical areas (some observers interpreted possible lesions as partial volume from adjacent gray matter or enlarged sulci, whereas other included in the measurements areas that could reasonably be considered as Virchow-Robin spaces) (figure 1, a and b). On the basis of these observations, we identified possible ways to increase reproducibility (seeAppendix). We are aware that some of these rules are arbitrary. However, the ultimate goal of such rules is to ameliorate the reliability of measurements in settings, like clinical trials, where reproducibility is more important than accuracy if changes of lesion loads over time are to be detected.14
Figure 1. Axial proton-density images with lesions outlined by a single observer using the local thresholding technique before (A) and after (B) establishing rules for lesion load measurements and training. (A) A hyperintense area located subcortically in the right hemisphere was not considered as a lesion (blue arrow). It was considered as a lesion when rules for lesion identification (seeAppendix) were used (B, blue arrow). The zoomed areas show that different (i.e., more or less conservative) approaches may be chosen to define the boundaries of the lesions.
For both the segmentation techniques, the use of rules for lesion load measurements in MS and the training performed had a clear effect on the interobserver variability but not on the intraobserver variability.
For reasons explained in the Appendix, the approach used to design the rules was "conservative" (i.e., false-negative ROIs were preferred to false-positive ROIs). Our results indicate that the observers applied the rules in the right way because the lesion volumes measured after training were significantly lower than those obtained in the free condition. This is also the most likely explanation for the dramatic reduction of interobserver variability.
The lack of a positive effect on intraobserver variability might be secondary to the fact that even if the rules and the training have an overall positive effect on interrater variability, they might also be confounding for some of the observers for several reasons: difficulty in changing operator strategies when operators are experienced, a too short training period, and a need for visual examples of the rules. As a consequence, one could argue that the use of these rules and a longer and more structured training period might have a stronger positive effect on less experienced observers. Another possible explanation for the absence of any effect on intrarater reproductibility is, for obvious reasons, that the rules pay more attention to lesion outlining rather than lesion identification. Therefore, our results confirm the conventional wisdom that both the steps used in lesion segmentation are crucial in determining the interobserver variability. Assuming that rules were more effective in reducing variability from lesion outlining, other strategies to reduce variability from lesion identification were tested.
Effect of different measurement strategies. For the second set of measurements, both radiologic marking and adjacent slices and corresponding T2-weighted images were aimed at reducing measurement variability secondary to lesion identification. The use of radiologic marking was particularly helpful in reducing intraobserver variability when the LTT was used. This is due to the fact that the outlining of lesions is less operator dependent in the LTT(although in a few cases more than one outline can be obtained by clicking the pointer within the lesions at different distances from the edges[figure 2]; seeAppendix) and any improvement in lesion identification should result in better reproducibility. The lack of a similar effect when adjacent slices and corresponding T2-weighted images were used indicates that lesion identification strategies, at least in well-experienced observers, are not affected by such an approach. Again, we do not know whether this might be helpful in less-trained situations.
Figure 2. MS lesion zoomed from an axial proton-density image and segmented by a single observer using the local thresholding technique. Using this technique, it is possible, by changing the site where the pointer is clicked, to obtain two different outlines, one inner (lower part of the figure) and one outer (upper part of the figure). As indicated in the Appendix, the inner outline should be chosen.
However, both strategies were helpful in reducing interobserver variability compared with the free condition, but no further positive effects were observed after radiologic marking. Because radiologic marking is time consuming, it might be argued it is not necessary. However, the interobserver PV when this approach was used were slightly better than those obtained for both the segmentation techniques when adjacent and T2-weighted images were used. In addition, again, we cannot rule out a possibly greater effect of such an approach in less-experienced observers, where a personal decision about what has to be considered as an MS lesion might result in a very poor interobserver reproducibilities. Also, in serial scans when the same observer measures all scans, the marking system should improve the reproducibility.
In conclusion, several of our findings demonstrate that lesion identification and outlining are important sources of variation for MRI lesion load measurements in MS and that the magnitude of their effect is similar. In this study, we identified some possible strategies to reduce such variation and demonstrated that they are effective even when experienced observers are involved, thus suggesting greater benefits to less-trained individuals. In this respect, we are currently making available on an internet site for public consultation our training guidelines and associated visual aids showing what we believe are good or poor examples of lesion identification and outlining. The set of rules presented here have already been adopted by all the groups participating in the European Magnetic Resonance Network in Multiple Sclerosis. However, if an operator-dependent segmentation technique is used in large multicenter trials in MS, it is still desirable to use a single observer to measure scans from the same individual patients. It is also advisable that an experienced radiologist identifies MS lesions before quantitative assessment is performed and specific rules and training for lesion outlining are used.
Acknowledgments
We thank Dr. Mark A. Horsfield (Department of Medical Physics, University of Leicester, Leicester, UK) and Dr. Liqun Wang (NMR Research Unit, Institute of Neurology, London, UK) for their helpful contribution during the performance of the study and Mr. David Plummer (Department of Medical Physics, University College, London, UK) for providing the software for image display.
Appendix
General statements. For the possible applications of lesion load measurements in clinical settings (i.e., serial measurements for clinical trials), it is preferable to have false-negative rather than false-positive lesion detection (as regards both lesion identification and boundaries). The reason for this conservative approach is because most MS lesions in any stage of the disease are inactive and it is important to reliably measure any change.15 Therefore, only hyperintense areas, considered lesions with high degree of certainty, should be included and areas of vague hyperintensity around clearly visible lesions should not be part of the detected ROI.
Rules for image displaying. Possible disagreements in lesion identification and definition of lesion boundaries could be related to the way the observer displays images on the computer screen. The following approach should be used to standardize image display:
-
The image should be displayed such that for a 256 by 256 image matrix, the image display window is 640 by 640 pixels and pro rata for different rectangular matrix sizes. This will leave enough space on the bottom and side of the display screen for control panels and another display for guidance.
-
The best contrast between white matter, gray matter, and lesions should be obtained.
-
The image should be zoomed to include a rectangular area containing only the brain.
-
Lesions should be identified.
-
Areas identified as lesions should be zoomed (seefigures 1, a and b, and 2).
-
ROIs should be defined using the conservative approach (i.e., for manual outlining, mildly hyperintense areas around visible lesions should not be included; for LTT, which can sometimes give different outlines according to the chosen starting point [see figure 2], the inner rather than the outer outline should be preferred).
Rules for measurements. Posterior fossa. In the posterior fossa, flow-related artifacts can easily mimic MS lesions. Thus, inclusion of areas of hyperintensity in lesion volume measurements when they are close to clearly evident artifacts should be avoided, unless a high degree of certainty is met using adjacent slices and/or the corresponding T2-weighted images;
Periventricular regions. In periventricular regions, there are normal structures that appear hyperintense on proton density-weighted images. Nevertheless, MS lesions occur more frequently in these areas. Therefore, periventricular "caps" around frontal horns (unless they are very small and symmetric) and hyperintense rims around ventricles should be included. On the contrary, any subcallosal/septum pellucidum hyperintensity should not be included.
Cortical/subcortical areas (see figures 1, a and b, and 3). Equivocal areas of hyperintensity in or close to the cerebral cortex should be considered as lesions when they are as bright as gray matter if a rim of white matter is visible around them or when they are brighter than the gray matter if directly adjacent to it. In addition, the adjacent slices and the morphology of the area should be checked to minimize the likelihood that the region is an island of cortex within the subcortical white matter.
Figure 3. MS lesions zoomed from an axial proton-density image and segmented by a single observer using manual outlining (lower part of the figure) and the local thresholding technique(upper part of the figure). It is clear that lesions when segmented using the local thresholding technique tend to be larger than when segmented using manual outlining (see tables 1 and 3). In the left cerebral hemisphere, two cortical/subcortical lesions are identified according to the rules in the Appendix.
Definition of the number of lesions when two or more areas of increased signal are adjacent. More than one ROI should be included in the measurement if a complete rim of normal-appearing white matter separate two or more areas of hyperintensity (see figure 3); otherwise, only one ROI should be included.
Presence of normal-appearing white matter in the middle of large confluent lesions. If there is no connection between normal-appearing white matter in the middle of large confluent lesions and normal-appearing white matter around the lesion, this should be included in the corresponding ROI.
Definition of lesion boundaries. For manual outlining, the line defining the ROI should pass through the pixels considered the outer part of the lesion. For the LTT, because usually by changing the site where the pointer is clicked it is possible to obtain two different outlines, one inner and one outer (see figure 2), the most conservative estimate (i.e., the inner outline) should be considered. In the case of periventricular lesions with equivocal mildly hyperintense abnormalities around them, the first attempt to define the corresponding ROI, when the LTT is used, should be done by clicking the pointer on the weakest intense part of the lesion boundary adjacent to the ventricle to not include part of the ventricles within the ROI. For discrete lesions, the pointer should be clicked on the brightest part of the lesion boundary to reduce the number of pixels within the ROI (i.e., conservative approach).
Footnotes
-
Supported by the EC funded (ERBCHRXCT 940684) European Magnetic Resonance Network in Multiple Sclerosis (MAGNIMS).
Received March 13, 1997. Accepted in final form July 17, 1997.
References
- 1.↵
Filippi M, Horsfield MA, Tofts PS, Barkhof F, Thompson AJ, Miller DH. Quantitative assessment of MRI lesion load in monitoring the evolution of multiple sclerosis. Brain 1995;118:1601-1612.
- 2.↵
- 3.↵
Paty DW, Li DBK, Oger JJF, et al. Magnetic resonance imaging in the evaluation of clinical trials in multiple sclerosis. Ann Neurol 1994;36:S95-S96.
- 4.↵
Stone L, Albert PS, Smith ME, et al. Changes in the amount of diseased white matter over time in patients with relapsing-remitting multiple sclerosis. Neurology 1995;45:1808-1814.
- 5.↵
Filippi M, van Waesberghe JH, Horsfield MA, et al. Interscanner variation in brain MRI lesion load measurements in MS: implications for clinical trials. Neurology 1997;49:371-377.
- 6.↵
Filippi M, Horsfield MA, Campi A, Mammi S, Pereira C, Comi G. Resolution-dependent estimates of lesion volumes in magnetic resonance imaging studies of the brain in multiple sclerosis. Ann Neurol 1995;38:749-754.
- 7.
Filippi M, Yousry T, Baratti C, et al. Quantitative assessment of MRI lesion load in multiple sclerosis: a comparison of conventional spin-echo with fast-fluid-attenuated inversion recovery. Brain 1996;119:1349-1355.
- 8.
Rovaris M, Gawne-Cain ML, Wang L, Miller DH. A comparison of conventional and fast spin-echo sequences for the measurement of lesion load in multiple sclerosis using a semi-automated contour technique. Neuroradiology 1997;39:161-165.
- 9.↵
Filippi M, Horsfield MA, Bressi S, et al. Intra- and interobserver agreement of brain MRI lesion volume measurements in multiple sclerosis: a comparison of techniques. Brain 1995;118:1593-1600.
- 10.
Grimaud J, Lai M, Thorpe JW, et al. Quantification of MRI lesion load in multiple sclerosis: a comparison of three computer-assisted techniques. Magn Reson Imag 1996;14:495-505.
- 11.↵
Gawne-Cain ML, Webb S, Tofts P, Miller DH. Lesion volume measurement in multiple sclerosis: how important is accurate repositioning? JMRI 1996;6:705-713.
- 12.
Filippi M, Marcianò N, Capra R, et al. The effect of imprecise repositioning on lesion volume measurements in patients with multiple sclerosis. Neurology 1997;49:274-276.
- 13.↵
- 14.↵
Evans A, Frank JA, Antel J, Miller DH. The role of MRI in clinical trials of multiple sclerosis. Comparison of image processing techniques. Ann Neurol 1997;41:125-132.
- 15.↵
Barkhof F, Filippi M, Miller DH, Tofts P, Kappos L, Thompson AJ. Strategies for optimizing MRI techniques aimed at monitoring disease activity in multiple sclerosis treatment trials. J Neurol 1997;244:76-84.
Letters: Rapid online correspondence
REQUIREMENTS
If you are uploading a letter concerning an article:
You must have updated your disclosures within six months: http://submit.neurology.org
Your co-authors must send a completed Publishing Agreement Form to Neurology Staff (not necessary for the lead/corresponding author as the form below will suffice) before you upload your comment.
If you are responding to a comment that was written about an article you originally authored:
You (and co-authors) do not need to fill out forms or check disclosures as author forms are still valid
and apply to letter.
Submission specifications:
- Submissions must be < 200 words with < 5 references. Reference 1 must be the article on which you are commenting.
- Submissions should not have more than 5 authors. (Exception: original author replies can include all original authors of the article)
- Submit only on articles published within 6 months of issue date.
- Do not be redundant. Read any comments already posted on the article prior to submission.
- Submitted comments are subject to editing and editor review prior to posting.
You May Also be Interested in
Hemiplegic Migraine Associated With PRRT2 Variations A Clinical and Genetic Study
Dr. Robert Shapiro and Dr. Amynah Pradhan
Related Articles
- No related articles found.