Establishing validity and reliability

Steve Simon


Dear Professor Mean, I need to establish validity and reliability of a new measurement. How do I do this?

Dear Reader, validity and reliability are two very dangerous words to use, because they mean different things to different people. Here are some issues to consider.

Face validity

Face validity is the extent to which your measurement appears valid to the general public. Does the measurement make intuitive sense? A closely related concept is content validity, the extent to which your measurement makes intuitive sense to a group of experts in the area.

Construct validity

Construct validity is establishing that you are measuring what you think you are measuring. You do this in two very different ways. First you establish that your measure correlates well with what it should correlate well with (convergent validity) and is uncorrelated with what it should be uncorrelated with (discriminant validity).

Criterion validity

Criterion validity is the extent to which your measurement correlates with a gold standard. You can establish this two ways. First, you can see if your measure predicts a future measure that indicates the same thing (predictive validity). Second, you can see if your measure correlates well with other concurrent measures which have been established as gold standards.


Two large hospitals in the Netherlands studied four risk assessment scales for pressure ulcers (BMJ 2002; 325: 797 (12 October)). Four separate scales were applied to patients without pressure ulcers who were expected to stay at least five days in the hospital. These patients were then followed for up to 12 weeks to see if a pressure ulcer did develop. They showed that all four scales had poor predictive validity–they were unable to make accurate predictions of which patients would develop pressure ulcers. The authors noted that there had been minimal evaluation of these scales beyond expert opinion and literature review.

Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers. Lisette Schoonhoven, Jeen R E Haalboom, Mente T Bousema, Ale Algra, Diederick E Grobbee, Maria H Grypdonck, and Erik Buskens. BMJ 2002; 325: 797. Available in html format or pdf format.

Further reading

Asmussen, L., L. M. Olson, et al. (1999). “Reliability and validity of the Children’s Health Survey for Asthma.” Pediatrics 104(6): e71. Available in pdf format.

Barry, D. (1996). “Differential recall bias and spurious associations in case/control studies.” Statistics in Medicine 15(23): 2603-16. Article is behind a paywall.

Bartko, J. J. (1966). “The intraclass correlation coefficient as a measure of reliability.” Psychological Reports 19(1): 3-11. Avilable in pdf format.

Bartko, J. J. (1976). “On Various Intraclass Correlation Reliability Coefficients.” Psychological Bulletin 83(5): 762-765. Article is behind a paywall.

Bartko, J. J. (1994). “Measures of agreement: a single procedure.” Statistics in Medicine 13(5-7): 737-45. Article is behind a paywall.

Beckett, M., M. Weinstein, et al. (2000). “Do health interview surveys yield reliable data on chronic illness among older respondents?” American Journal of Epidemiology 151(3): 315-23. Available in pdf format.

Block, G. (1982). “A review of validations of dietary assessment methods.” American Journal of Epidemiology 115(4): 492-505. Article is behind a paywall.

Bulloch, B. and M. Tenenbein (2002). “Validation of 2 pain scales for use in the pediatric emergency department.” Pediatrics 110(3): e33. Available in pdf format.

Caan, B., M. Slattery, et al. (1998). “Comparison of the Block and the Willett self-administered semiquantitative food frequency questionnaires with an interviewer-administered dietary history.” AJE 148(12): 1137-47. Available in pdf format.

Carey, R. G. and J. H. Seibert (1993). “A patient survey system to measure quality improvement: questionnaire reliability and validity.” Med Care 31(9): 834-45. Article is behind a paywall.

Carroll, R. T. The Forer effect (a.k.a. the P.T. Barnum effect and subjective validation). Available in html format.

Carroll, R. T. The Mozart Effect. The Mozart Effect is a term coined by Alfred A. Tomatis for the alleged increase in brain development that occurs in children under age 3 when they listen to the music of Wolfgang Amadeus Mozart. Available in html format.

Carroll, R. T. Myers-Briggs Type Indicator®.Available in html format.

Collins, S. L., R. A. Moore, et al. (1997). “The visual analogue pain intensity scale: what is moderate pain in millimetres?” Pain 72(1-2): 95-7. Article is behind a paywall.

Coughlin, S. S. (1990). “Recall bias in epidemiologic studies.” J Clin Epidemiol 43(1): 87-91. Article is behind a paywall.

Lee J. Cronbach, Paul E. Meehl (1955). “Construct Validity in Psychological Tests.” Psychological Bulletin 52: 281-302. Available in pdf format.

Crume, T. L., C. DiGuiseppi, et al. (2002). “Underascertainment of child maltreatment fatalities by death certificates, 1990-1998.” Pediatrics 110(2 Pt 1): e18 (1 - 6). Article is behind a paywall.

Day, N., N. McKeown, et al. (2001). “Epidemiological assessment of diet: a comparison of a 7-day diary with a food frequency questionnaire using urinary markers of nitrogen, potassium and sodium.” Int J Epidemiol 30(2): 309-17. Available in html format.

Ellman, M. S., C. M. Viscoli, et al. (1997). “A new index of prognostic severity for chronic asthma.” Chest 112(3): 582-90.

Grace, J. Research Fables from the Sisters Grinn, No. 2. Snow White and the Seven Threats to Validity.

Gray-Donald, K., J. O’Loughlin, et al. (1997). “Validation of a short telephone administered questionnaire to evaluate dietary interventions in low income communities in Montreal, Canada.” Journal of Epidemiology and Community Health 51(3): 326-331.

Hanley, J., A. Capewell, et al. (2001). “Validity study of the severity index, a simple measure of urinary incontinence in women [In Process Citation].” British Medical Journal 322(7294): 1096-7.

Jacobs, J., L. M. Jimenez, et al. (1994). “Treatment of acute childhood diarrhea with homeopathic medicine: a randomized clinical trial in Nicaragua.” Pediatrics 93(5): 719-25.

Jacobson, S. W., L. M. Chiodo, et al. (2002). “Validity of Maternal Report of Prenatal Alcohol, Cocaine, and Smoking in Relation to Neurobehavioral Outcome.” Pediatrics 109(5): 815-825.

Kipnis, V., D. Midthune, et al. (2001). “Empirical Evidence of Correlated Biases in Dietary Assessment Instruments and Its Implications.” Am. J. Epidemiol. 153(4): 394-403.

Labouvie, E., M. E. Bates, et al. (1997). “Age of First Use: Its Reliability and Predictive Utility.” Journal of Studies on Alcohol 58(6): 638-643.

Leffondre, K., M. Abrahamowicz, et al. (2002). “Modeling Smoking History: A Comparison of Different Approaches.” Am. J. Epidemiol. 156(9): 813-823.

Lemaitre, R. N., I. B. King, et al. (1998). “Assessment of trans-fatty acid intake with a food frequency questionnaire and validation with adipose tissue levels of trans-fatty acids.” Am J Epidemiol 148(11): 1085-93.

Levine, D. (1994). “True scores, error, reliability, and unit of analysis in environment and behavior research.” Environment and Behavior 26(2): 261-92.

Levine, D. W., D. F. Kripke, et al. (2003). “Reliability and validity of the Women’s Health Initiative Insomnia Rating Scale.” Psychol Assess 15(2): 137-48.

Lewin, R. J., D. R. Thompson, et al. (2002). “Validation of the Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable angina.” J Cardiopulm Rehabil 22(3): 184-91.

Lilienfeld, S. O. (2001). “What’s Wrong with This Picture? (Inkblot Test).” Scientific American: 81 -87.

Lilienfeld, S. O., J. M. Wood, et al. (2001). “What’s Wrong with This Picture?” Scientific American: 80-87.

Malviya, S., T. Voepel-Lewis, et al. (2002). “Depth of sedation in children undergoing computed tomography: validity and reliability of the University of Michigan Sedation Scale (UMSS).” Br J Anaesth 88(2): 241-5.

Matheson, D. M., K. A. Hanson, et al. (2002). “Validity of Children’s Food Portion Estimates: A Comparison of 2 Measurement Aids.” Arch Pediatr Adolesc Med 156(9): 867-71.

Moussa, M. A., M. Z. Shafie, et al. (1990). “Reliability of death certificate diagnoses.” J Clin Epidemiol 43(12): 1285-95.

Muller, C. (2000). “Rationale, interpretation, validation, and uses of sperm function tests.” Journal of Andrology 21(1): 10-30.

Nekolaichuk, C. L., E. Bruera, et al. (1999). “A comparison of patient and proxy symptom assessments in advanced cancer patients.” Palliat Med 13(4): 311-23.

No authors listed (1993). “The CRIB (clinical risk index for babies) score: a tool for assessing initial neonatal risk and comparing performance of neonatal intensive care units. The International Neonatal Network.” Lancet 342(8865): 193-8.

Penetar, D., U. McCann, et al. (1993). “Caffeine reversal of sleep deprivation effects on alertness and mood.” Psychopharmacology (Berl) 112(2-3): 359-65.

Richardson, D. K., J. E. Gray, et al. (1993). “Score for Neonatal Acute Physiology: a physiologic severity index for neonatal intensive care.” Pediatrics 91(3): p617-23.

Sanders, C., M. Egger, et al. (1998). “Reporting on quality of life in randomised controlled trials: bibliographic study.” Bmj 317(7167): 1191-4.

Schoonhoven, L., J. R. Haalboom, et al. (2002). “Prospective cohort study of routine use of risk assessment scales for prediction of pressure ulcers.” Bmj 325(7368): 797.

Shrout, P. E. and J. L. Fleiss (1979). “Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychological Bulletin 86(2): 420-28.

Simpson, D. and R. Fincher (1999). “Making a Case for the Teaching Scholar.” Academic Medicine 74(12): 28-31.

Stelle, K. M., K. E. Bass, et al. (1999). “The Mystery of the Mozart Effect: Failure to Replicate.” Psychological Science 10(4): 366-369.

Sunmola, A. M. (2001). “Developing a scale for measuring the barriers to condom use in Nigeria.” Bull World Health Organ 79(10): p926-32.Taylor, J. K. (1983). “Validation of Analytical Methods.” Analytical Chemistry.

The Royal Windsor Society for Nursing Research. Instrument Validity.

Thomas, E., D. Studdert, et al. (2002). “The reliability of medical record review for estimating adverse event rates.” Ann Intern Med 136(11): 812-816.

Thompson, F., H. Metzner, et al. (1990). “Characteristics of individuals and long term reproductibility of dietary reports: the Tecumseh Diet Methodology Study.” J Clin Epidemiol 43(11): 1169-78.

Trochim, W. M. K. Types of Reliability. Excerpt: You learned in the Theory of Reliability that it’s not possible to calculate reliability exactly. Instead, we have to estimate reliability, and this is always an imperfect endeavor. Here, I want to introduce the major reliability estimators and talk about their strengths and weaknesses.

Walsh, D. A. and D. A. Gentile (2001). “A validity test of movie, television, and video-game ratings.” Pediatrics 107(6): p1302-8.

Wells, A., P. English, et al. (1998). “Misclassification rates for current smokers misclassified as nonsmokers.” American Journal of Public Health 88(10): 1503-09.

Werler, M. M., B. R. Pober, et al. (1989). “Reporting accuracy among mothers of malformed and nonmalformed infants.” Am J Epidemiol 129(2): p415-21.

Wirfait, A., R. Jeffery, et al. (1998). “Comparison of food frequency questionnaires: the reduced block and Willett questionnaires differ in ranking on nutrient intakes.” AJE 148(12): 1148-56.

Wright, S. P. (1999). “Reporting on quality of life in RCTs.” British Medical Journal 318(7191): 1142.

Instrument Validity. The Royal Windsor Society for Nursing Research. Accessed on November 4, 2002.

The Examination Chapter of the Neurology section of the Family Practice Notebook,, has some interesting tests like the mini-mental state exam that could serve as good examples of developing validity.

Psychology Learning Resources. Internal Validity Tutorial. Available in html format. The Cochrane Group defines internal validity as “the extent to which the observed effects are true for the people in a study” –

J Psychosom Res 2002 Feb;52(2):69-77. The validity of the Hospital Anxiety and Depression Scale. An updated literature review. Bjelland I, Dahl AA, Haug TT, Neckelmann D.

J Cardiopulm Rehabil 2002 May-Jun;22(3):184-91. Validation of the Cardiovascular Limitations and Symptoms Profile (CLASP) in chronic stable angina. Lewin RJ, Thompson DR, Martin CR, Stuckey N, Devlen J, Michaelson S, Maguire P.Psychosomatics 2001 Sep-Oct;42(5):423-8. Sensitivity and specificity of observer and self-report questionnaires in major and minor depression following myocardial infarction. Strik JJ, Honig A, Lousberg R, Denollet J.

Disabil Rehabil 2001 Nov 10;23(16):737-44. Screening for anxiety, depressive and somatoform disorders in rehabilitation–validity of HADS and GHQ-12 in patients with musculoskeletal disease. Harter M, Reuter K, Gross-Hardt K, Bengel J.

You can find an earlier version of this page on my original website.