Evidence-based measurement
Which disability scale for neurologic rehabilitation?
Citation Manager Formats
Make Comment
See Comments

Abstract
Objective: To compare the 10-item Barthel Index (BI), 18-item Functional Independence Measure (FIM), and 30-item Functional Independence Measure + Functional Assessment Measure (FIM+FAM) as measures of disability outcomes for neurologic rehabilitation.
Methods: A total of 149 inpatients from two rehabilitation units in South England specializing in neurologic disorders were studied. Traditional psychometric methods were used to evaluate and compare acceptability (score distributions), reliability (internal consistency, intrarater reproducibility), validity (concurrent, convergent and discriminant construct), and responsiveness (standardized response mean).
Results: All three rating scales satisfied recommended criteria for reliable and valid measurement of disability, and are acceptable and responsive in this study sample. The FIM and FIM+FAM total scales are psychometrically similar measures of global disability. The BI, FIM, and FIM+FAM motor scales are psychometrically similar measures of physical disability. The FIM and FIM+FAM cognitive scales are psychometrically similar measures of physical disability.
Conclusions: In the sample studied, the BI, FIM, FIM+FAM have similar measurement properties, when examined using traditional psychometric analyses. Although instruments with more items and item response categories generate more qualitative information about an outcome, they may not improve its measurement. Results highlight the importance of using recognized techniques of scale construction to develop health outcome measures.
Changes in health policy have underlined the importance of measuring patient-oriented outcomes. Most of these measures are summed rating scales: several items are summed to generate a total score to quantify the outcome of interest. This measurement method, developed in the social sciences,1 generates rigorous health measures.2
The history of disability measurement in neurologic rehabilitation has assumed that longer instruments are superior measures. The 10-item Barthel Index (BI) was developed in 1955.3 It remains a cornerstone of disability measurement.4 In 1983, the 18-item Functional Independence Measure (FIM)5 was developed because the BI was considered too restricted and poorly responsive.6 A US Medicare prospective payment scheme is now based on the FIM.7 In 1989, the FIM+FAM (Functional Independence Measure + Functional Assessment Measure)8 was developed by adding 12 items to the FIM, because clinicians believed the FIM too limited to measure the complex disabilities of brain injury. The FIM+FAM is gaining popularity9 and now is recommended for all patients undergoing neurologic rehabilitation. Although intuitively sound rationales underpin the development of the FIM and FIM+FAM, the policy to promote their widespread use has significant implications for clinical practice because the longer measures require more time and trained raters, and should be rated by team consensus after several days’ observation.
One study has examined, using Rasch analysis,10 the extent to which the 12 FAM items extends the range of item calibrations of the FIM.11 However, no study has compared the measurement properties of the BI, FIM, and FIM+FAM. This was the aim of our study.
Methods.
Participants and treatment.
Participants were recruited at two clinical sites: the Neurorehabilitation Unit (NRU) at the National Hospital for Neurology and Neurosurgery in London, and the Rehabilitation Research Unit (RRU) at Southampton General Hospital. Both units provide intensive, multidisciplinary, goal-oriented, inpatient rehabilitation that is tailored to individual patients according to the neurologic diagnosis, disabilities, handicaps, and needs. The recruitment strategy differed at each unit because of the varying rates of patient turnover and was predetermined to limit selection bias. The NRU recruited the first two admissions each week for 18 months, whereas the RRU recruited consecutive admission for 12 months. Ethical approval and informed consent were obtained.
Outcome measures.
The BI, FIM, and FIM+FAM are observer-rated, multi-item, summed rating scales to evaluate disability in terms of dependency (table 1). As recommended,12 the version of the BI by Collin et al. was used in this study. It includes items with 2-point (two items), 3-point (six items), and 4-point (two items) response options that are summed to generate a total score. In contrast, FIM and FIM+FAM items have 7-point response options and three summed scores can be generated for each measure: total, motor, and cognitive subscales. For all measures, low scores indicate greater disability. Reliability, validity, and responsiveness have been demonstrated for the BI and FIM (for reviews, see McDowell and Newell12), but the FIM+FAM has had limited psychometric evaluation. Staff raters received formal training in rating the FIM and FIM+FAM.
Items of the Barthel Index (BI), Functional Independence Measure (FIM), and Functional Independence Measure + Functional Assessment Measure (FAM)
Other validated measures were collected to evaluate disability (Office of Population Censuses and Surveys Disability Scales, OPCS4); handicap (London Handicap Scale, LHS13); physical and mental health status (Medical Outcomes Study 36-Item Short-Form Health Survey [SF-36]14); psychological well-being (General Health Questionnaire [GHQ-28]12); global cognitive function (four verbal subtests of the revised Wechsler Adult Intelligence Test [WAIS-R Verbal IQ]15); reasoning (Verbal and Spatial Reasoning Test [VESPAR]15); and verbal memory (California Verbal Learning Test [CVLT]15)
Data collection.
The following data were collected within 4 days of admission: demographic and diagnostic information; the BI, FIM, and FIM+FAM, rated by consensus opinion of the treating multidisciplinary team; the LHS, OPCS, SF-36, and GHQ-28, administered by the study coordinators; and WAIS-R Verbal IQ, VESPAR, and CVLT, administered by a neuropsychologist (D.W.L.) at the NRU. Within 2 days of discharge, the multidisciplinary team rated the BI, FIM, and FIM+FAM again.
Analyses.
Standard traditional psychometric analyses16-18⇓⇓ were undertaken to evaluate whether the items of the BI, FIM, and FIM+FAM could be summed to generate total scores, and whether these resulting total scores satisfied recommended criteria for acceptability, reliability, validity, and responsiveness. Three comparisons were made. The total scores of the FIM and FIM+FAM were compared as measures of global disability, the BI and the motor subscale scores of the FIM and FIM+FAM were compared as measures of physical disability, and the cognitive subscale scores of the FIM and FIM+FAM were compared as measures of cognitive disability.
The basic assumption underlying summated rating scales is that items can be summed without weighting or standardization if they are internally consistent.1,19⇓ Three indicators of internal consistency were examined: corrected item-total correlations, mean interitem correlations, and alpha coefficients. Corrected item-total correlations are product-moment correlations between each item and the sum of the remaining items in the scale; they indicate the extent to which each item relates to the construct measured by the total score. “Correcting” the total score by removing the item of interest prevents spuriously high values due to item overlap.20 Recommended minimum values include 0.20,17 0.30,16 and 0.40.21 We adopted the most stringent of these criteria. Interitem correlations indicate the extent to which the individual items of a rating scale are related. It is recommended that the mean interitem correlation should exceed 0.30.22 Cronbach’s alpha coefficients23 estimate the internal consistency of an item group. Alpha coefficients exceeding 0.80 are considered acceptable for scales used to make group comparisons.16
Acceptability, the extent to which the score distributions of a rating scale adequately represent the distribution of health in a sample, was determined by examining the distributions of admission total scores.18 It is recommended that scores span the full scale range, mean scores be near the scale midpoint,22 and floor and ceiling effects (percentage of sample scoring minimum and maximum scores, respectively) do not exceed 20%.24
Reliability is the extent to which a rating scale is free from random error.16 Two types of reliability were examined. Internal consistency was determined for admission scores as described previously. Intrarater reproducibility was estimated in a subsample of patients by determining the agreement, reported as an intraclass correlation coefficient (ICC), between ratings made by the same multidisciplinary team for the same patients on two different occasions. There are numerous versions of the ICC.25 Because the members of the rating multidisciplinary teams varied, the estimates of variance were obtained from repeated-measures analysis of variance under a random effects model.26 Values should exceed 0.80.16 The reproducibility of the BI was not examined because this has been shown to be high in several previous studies (for review, see McDowell and Newell12).
Two types of validity were examined. Concurrent validity, the extent to which rating scales measure the same construct, was quantified using Pearson’s product-moment correlations and an ICC.27 ICCs exceeding 0.75 have been suggested to indicate “excellent” agreement.28 However, because the magnitude of both Pearson’s correlations and ICCs is influenced by the range of scores and presence of extreme values,29 scatterplots of scores also were examined. Because each scale has a different scale range, they were transformed to a 0 to 100 range using the formula suggested by Stewart and Ware2:equation
Convergent and discriminant construct validity, the extent to which a scale’s correlations with measures of similar and dissimilar constructs conform with a priori hypotheses,30 was determined by examining the magnitude and pattern of product-moment correlations with OPCS, LHS, SF-36, GHQ-28, and WAIS-R.
Responsiveness was determined by calculating effect sizes31 from admission and discharge total scores. There are many effect size calculations.32 Standardized response means,33 the mean change score divided by SD of change scores, were chosen because these are the most relevant to clinical studies. A multiple of this statistic is used to determine the statistical significance of within-group change.
Results.
There were 149 participants (table 2). Men and women are evenly represented and a wide range of ages and lengths of stay is included. Stroke and MS are the largest diagnostic groups. In the stroke subsample (n = 45), 27 (60%) were admitted within 3 months of the stroke, 6 (13%) between 3 and 6 months poststroke, and 4 (9%) between 6 and 12 months poststroke. A further eight (18%) people were admitted more than 1 year poststroke. The range was 3 weeks to 12 years poststroke, and 13 (29%) were admitted within 1 month of their stroke. In the MS subsample (n = 64), 52 (81%) had secondary progressive MS, 7 (11%) had primary progressive MS, and 5 (8%) had relapsing remitting MS. Samples from the two neurologic rehabilitation units have similar sex and age distributions, but different size, case mix, and length of stay.
Characteristics of participants
Internal consistency.
Corrected item-total correlations exceed 0.40, mean interitem correlations exceed 0.30, and alphas exceed 0.80 (table 3). These findings support the generation of summed scores for the BI, FIM, and FIM+FAM.
Internal consistency, acceptability, reproducibility, convergent and discriminant construct validity, and responsiveness of the FIM, FIM+FAM, and BI
Acceptability.
All scales demonstrate good score variability, have mean scores near the scale midpoints, and small floor and ceiling effects (see table 3). Therefore, all scales satisfy recommended criteria for acceptability. There appear to be no clear advantages for any one scale over corresponding measures of global, motor, or cognitive disability.
Reliability.
All scales satisfy recommended criteria for internal consistency reliability (see earlier) and intrarater reproducibility (see table 3). Estimates for competing scales are similar, suggesting that there appear to be no clear advantages in reliability of one scale over corresponding scales.
Validity.
Scales purporting to measuring the same aspect of disability were highly related (Pearson’s r = 0.96 to 0.996) and had high levels of agreement (ICC = 0.95 to 0.995). The scatterplots (not shown) demonstrate that the strong relationships between competing scales hold throughout the full range of scores, and that the results are not biased by extreme values. These results suggest that the FIM and FIM+FAM total scores; BI, FIM, and FIM+FAM motor scores; and the FIM and FIM+FAM cognitive scores measure very similar constructs.
Table 3 shows correlations between the BI, FIM, and FIM+FAM and six measures of similar and different constructs. For each of the scales and subscales, the direction, magnitude, and pattern of correlations provide evidence for their convergent and discriminant validity. For example, the BI correlates highly with the OPCS disability measure (evidence for convergent validity) and has low to moderate correlations with measures of handicap, health status, psychological distress, and neuropsychological functioning (evidence for discriminant validity). More important, the magnitudes of correlation between corresponding scales and subscales of the BI, FIM, and FIM+FAM and the six measures of similar and dissimilar constructs are very comparable, suggesting that they have similar convergent and discriminant construct validity.
Responsiveness.
Standardized response means for BI, FIM, and FIM+FAM scales measuring global, motor, and cognitive disability are similar (see table 3), suggesting that there is no advantage in responsiveness of one measure over another.
Discussion.
This study demonstrates that the BI, FIM, and FIM+FAM satisfy criteria as rigorous measures of neurologic disability. More important, however, findings highlight their similar psychometric properties when evaluated using traditional psychometric methods. The FIM and FIM+FAM are psychometrically similar measures of global, physical, and cognitive disability, and the BI, FIM, and FIM+FAM motor scales are psychometrically similar measures of physical disability.
These findings are surprising, and have far-reaching implications for disability measurement in neurologic rehabilitation. They suggest that, in this sample of patients and using these psychometric methods of rating scale analysis, the newer and longer FIM and FIM+FAM appear to offer few advantages as measurement instruments over the more practical and economical BI. This is important because the newer measures are time consuming and have become the standard by which efficacy is measured by clinicians, researchers, and third-party reimbursement institutions.
This study has three important implications for outcome measurement in general. First, results indicate that choice of measurement instruments should be empirically led. Assumptions that the FIM and FIM+FAM are superior measures because they have greater numbers of items and item response categories are not supported by empirical evidence from this study. Therefore, comprehensive head-to-head comparisons of new with existing outcome measures are important before new measures are introduced into clinical practice. Although comparisons of individual measurement properties are becoming increasingly common, there are few comprehensive comparisons of the full range of psychometric properties between similar measures. Current guidelines for instrument evaluation18,34,35⇓⇓ should be amended to include this recommendation.
Second, this study highlights an important difference in emphasis between clinical assessment and measurement. There is no doubt that the greater number of items and item response categories in the FIM and FIM+FAM provides more comprehensive clinical assessments of disability than the BI. It is, therefore, understandable why clinicians who manage the day-to-day problems of individual people may prefer the longer instruments and may believe them to be superior measures. However, measurement concerns the quantification of an attribute and, as this and other studies2 demonstrate, multi-item measures need only a few carefully chosen items to generate reliable and valid estimates. Furthermore, when used for measurement, rating scales such as the BI, FIM, and FIM+FAM are recommended for group comparison studies and not for individual patient clinical decision making.36 This is because the 95% confidence intervals (95% CI) around individual patient scores, which are calculated from the SEM (= SD × √[1 − reliability]; 95% CI = 1.96 SEM37), are wide (e.g., ±11.5 points for the FIM total score).
Third, this study highlights the importance of applying psychometric methods to health measurement. The FIM and FIM+FAM, along with many measures used in clinical practice, were developed clinically. That is, items were selected on the basis of their clinical relevance. Although intuitively sound, this method of scale development assumes that these items have good measurement properties and that all items are required in a scale. Traditional psychometric methods of scale development do not make these assumptions: items are selected from a large pool of items on the basis of their psychometric performance in empirical field tests. Consequently, the minimum number of items that achieve rigorous measurement can be chosen to avoid item redundancy. Because the number of items in a measure influences its clinical usefulness and scientific rigor, psychometric methods minimize the trade-off between these competing demands.
The psychometric similarity of the BI, FIM, and FIM+FAM suggests considerable item redundancy in the longer measures. Some of this redundancy is predictable. For example, the four transfer items of the FIM+FAM are highly correlated (r = 0.81 to 0.96), implying that, despite their clinical relevance to the assessment of disabled people, the individual items contribute little unique information for measurement. However, strong relationships between other items are less predictable. For example, transferring into a shower/bath is highly correlated with dressing, bathing, and toileting (r = 0.75 to 0.79), implying that this item represents a wide range of functional activities. This finding illustrates why item selection for measurement purposes should be based on knowledge of their empirical relationships rather than on their clinical relevance.
Although one apparent advantage of the FIM and FIM+FAM over the BI is that they directly address cognitive disability, the extent of this advantage is uncertain because the validity of the cognitive subscales has not been comprehensively examined. Furthermore, FIM motor and not cognitive scores have been shown to be the strongest predictor of patients’ ability to return home after rehabilitation.38
This study included only people undergoing neurologic rehabilitation, predominantly with MS and stroke. Therefore, results may not be generalizable to other clinical settings and samples or different times. Although the FIM+FAM has not been widely evaluated, studies of the BI12 and FIM7 demonstrate psychometric stability across different samples. Nevertheless, because psychometric properties are sample dependent,16 and because clinical trials typically are carried out in more homogeneous samples than we have studied, it is essential that further comparisons of their performance in different settings are undertaken to understand fully the similarities and differences. Because of this fact, we re-ran the psychometric analyses separately for four subgroups that differed significantly in case mix, disability level, and length of stay. Results of these analyses support the similarity of the three instruments in these four samples (data not reported).
It also is important that the BI, FIM, and FIM+FAM should be compared using newer psychometric methods, such as Rasch10 and Item Response Theory39 models. These methods of analyzing rating scale data were developed in education and are being used increasingly in health measurement as alternatives to traditional psychometric approaches. However, their role in health measurement has yet to be clearly defined.40-44⇓⇓⇓⇓
Acknowledgments
Supported by a grant from the National Health Service Central Audit Fund; a grant from the North Thames Regional Health Authority Research and Development Responsive Funding Program; and a grant from The Wellcome Trust (Research Training Fellowship in Health Services Research [J.H.]).
- Received June 23, 2000.
- Accepted April 19, 2001.
References
- ↵
Likert RA. A technique for the development of attitudes. Arch Psychol . 1932; 140: 5–55.
- ↵
Stewart AL, Ware JE Jr, eds. Measuring functioning and well-being: the Medical Outcomes Study approach. Durham, NC: Duke University Press, 1992.
- ↵
- ↵
Wade DT. Measurement in neurological rehabilitation. Oxford: Oxford University Press, 1992.
- ↵
Granger CV, Hamilton BB, Keith RA, Zielezny M, Sherwin FS. Advances in functional assessment for medical rehabilitation. Top Geriatr Rehabil . 1986; 1: 59–74.
- ↵
- ↵
- ↵
- ↵
Hawley CA, Taylor R, Hellawell DJ, Pentland B. Use of the functional assessment measure (FIM+FAM) in head injury rehabilitation: a psychometric analysis. J Neurol Neurosurg Psychiatry . 1999; 67: 749–754.
- ↵
Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press, 1960.
- ↵
Linn R, Blair R, Granger C, et al. Does the Functional Assessment Measure extend the Functional Independence Measure (FIM) instrument? A Rasch analysis of stroke inpatients. J Outcomes Meas . 1999; 3: 339–359.
- ↵
McDowell I, Newell C. Measuring health: a guide to rating scales and questionnaires, 2nd ed. Oxford: Oxford University Press, 1996.
- ↵
Harwood RH, Ebrahim S. Manual of the London Handicap Scale. Nottingham, United Kingdom: Department of Health Care of the Elderly, University of Nottingham, 1995.
- ↵
Ware JE Jr, Snow KK, Kosinski M, Gandek B. SF-36 Health Survey manual and interpretation guide. Boston, MA: Nimrod Press, 1993.
- ↵
Lezak MD. Neuropsychological assessment, 4th ed. New York: Oxford University Press, 2000.
- ↵
Nunnally JC, Bernstein IH. Psychometric theory, 3rd ed. New York: McGraw-Hill, 1994.
- ↵
Streiner DL, Norman GR. Health measurement scales: a practical guide to their development and use, 2nd ed. Oxford: Oxford University Press, 1995.
- ↵
- ↵
Spector PE. Summated rating scale construction: an introduction. Newbury Park, CA: Sage, 1992.
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
- ↵
Fleiss JL. The design and analysis of clinical experiments. New York: Wiley, 1986.
- ↵
- ↵
Rosner B. Fundamental of biostatistics. Toronto: Duxbury Press 1995;
- ↵
- ↵
- ↵
Cohen J. Statistical power analysis for the behavioural sciences. Hillside, NJ: Lawrence Erlbaum, 1969.
- ↵
- ↵
- ↵
- ↵
Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Evaluating patient-based outcome measures for use in clinical trials. Health Technol Assess 1998; 2(14).
- ↵
- ↵
Guilford JP. Psychometric methods, 2nd ed. New York: McGraw-Hill, 1954.
- ↵
- ↵
Lord FM, Novick MR. Statistical theories of mental test scores. Reading, MA: Addison-Wesley, 1968.
- ↵
Cella D, Chang C-H. A discussion of item response theory and its application in health status measurement. Med Care . 2000; 38: II-66–II-72.
- ↵
Hambleton RK. Item response theory modeling in instrument development and data analysis. Med Care . 2000; 38: II-60–II-65.
- ↵
Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care . 2000; 38: II-28–II-42.
- ↵
Ware JE Jr, Bjorner JB, Kosinski M. Practical implications of item response theory and computer adaptive testing: a brief summary of ongoing studies of widely used headache impact scales. Med Care . 2000; 38: II-73–II-82.
- ↵
Divgi D. Does the Rasch model really work for multiple choice items? Not if you look closely. J Educ Meas . 1986; 23: 283–298.
Letters: Rapid online correspondence
REQUIREMENTS
If you are uploading a letter concerning an article:
You must have updated your disclosures within six months: http://submit.neurology.org
Your co-authors must send a completed Publishing Agreement Form to Neurology Staff (not necessary for the lead/corresponding author as the form below will suffice) before you upload your comment.
If you are responding to a comment that was written about an article you originally authored:
You (and co-authors) do not need to fill out forms or check disclosures as author forms are still valid
and apply to letter.
Submission specifications:
- Submissions must be < 200 words with < 5 references. Reference 1 must be the article on which you are commenting.
- Submissions should not have more than 5 authors. (Exception: original author replies can include all original authors of the article)
- Submit only on articles published within 6 months of issue date.
- Do not be redundant. Read any comments already posted on the article prior to submission.
- Submitted comments are subject to editing and editor review prior to posting.
You May Also be Interested in
Dr. Jeffrey Allen and Dr. Nicholas Purcell
► Watch
Related Articles
- No related articles found.
Topics Discussed
Alert Me
Recommended articles
-
Brief Communications
Exploring disability rating scale responsiveness II: Do more response options help?S. J. Cano, R. J. O’Connor, A. J. Thompson et al.Neurology, December 11, 2006 -
Article
Patients with Stroke Confined to Basal Ganglia Have Diminished Response to Rehabilitation EffortsIchiro Miyai, Alan D. Blau, Michael Reding et al.Neurology, January 01, 1997 -
Articles
Wallerian degeneration of the pyramidal tract does not affect stroke rehabilitation outcomeI. Miyai, T. Suzuki, K. Kii et al.Neurology, December 01, 1998 -
Null Hypothesis
Natalizumab in acute ischemic stroke (ACTION II)A randomized, placebo-controlled trialMitchell S.V. Elkind, Roland Veltkamp, Joan Montaner et al.Neurology, June 26, 2020