The biggest statistics papers of all time

Steve Simon


I’m giving a short talk about the Kaplan-Meier curve and found out an interesting fact about the 1958 paper by Edward Kaplan and Paul Meier that introduced this curve. It represents the 11th most cited research paper of all time. There’s a nice graphic in a Nature paper that allows you to review the top 100 most cited papers of all time. There are a few other statistics papers on this list as well.

The Kaplan-Meier reference is Kaplan EL, Meier P. “Nonparametric estimation from incomplete observations.” J. Am. Stat. Assoc. 53, 457–481 (1958). It was cited over 38,000 times as of the publication date of the Nature article.

The next article is on the Cox regression model. The reference is Cox DR. “Regression models and life-tables” J. R. Stat. Soc., B 34, 187–220 (1972). It was cited over 28,000 times and represents the 24th most cited research paper.

The next best statistics paper is one that I’ve cited myself. Bland JM, Altman DG. “Statistical methods for assessing agreement between two methods of clinical measurement” Lancet 327, 307–310 (1986). This paper introduces the Bland-Altman plot and explains why a simple t-test or a simple correlation is the wrong way to compare two clinical measurements. It is ranked #29 and was cited over 23,000 times.

The next paper is one that a Psychology colleague recommended to me. Baron RM, Kenny DA. “The moderator–mediator variable distinction in social psychological-research — conceptual, strategic, and statistical considerations” J. Pers. Soc. Psychol. 51, 1173–1182 (1986). The term “moderator variable” and “mediator variable” sound very similar, but they are quite different and require quite different statistical models. This paper is ranked #33 and was cited over 23,000 times.

The next paper is more of an applied paper, but it prominently displays the bootstrap method, so it is worth mentioning here. Felsenstein J. “Confidence limits on phylogenies: an approach using the bootstrap” Evolution 39, 783–791 (1985). It is ranked #41 and was cited over 21,000 times.

You might argue whether this is a statistics paper or not, but ranked #46 with over 18,000 citations is Zadeh LA. “Fuzzy sets” Inform. Control 8, 338–353 (1965).

Ranked #57 with almost 16,000 citations is Dempster AP, Laird NM, Rubin DB. “Maximum likelihood from incomplete data via EM algorithm” J. R. Stat. Soc., B 39, 1–38 (1977). This paper documents a commonly used approach to estimate missing values using a wide range of statistical models.

Next is a paper that is (in my humble opinion) not deserving of that many citations. The paper, Duncan, D. B. “Multiple range and multiple F tests” Biometrics 11, 1–42 (1955), talks about a post hoc test used frequently in analysis of variance, but unlike other post hoc tests, it does not protect the overall Type I error rate. It appears that Duncan’s test is popular anyway. It is ranked #63 with over 15,000 citations.

Landis JR, Koch GG. “The measurement of observer agreement for categorical data.” Biometrics 33, 159–174 (1977) presents the kappa statistic. It is ranked #68 with almost 15,000 citations.

Akaike H. “A new look at statistical-model identification” IEEE Trans. Automat. Contr. 19, 716–723 (1974) presents the AIC (Akaike Information Criteria) statistic. It is ranked #73 with over 14,000 citations.

The first paper on the list that I was totally unfamiliar with is Kirkpatrick S, Gelatt CD, Vecchi MP. “Optimization by simulated annealing” Science 220, 671–680 (1983). Simulated annealing is used a lot in Statistics, but not by me. This paper is ranked #86 with over 13,000 citations.

Marquardt DW. “An algorithm for least-squares estimation of nonlinear parameters” J. Soc. Ind. Appl. Math. 11, 431–441 (1963) documents a fundamental algorithm for nonlinear least squares regression. It is ranked #88 with over 13,000 citations.

Ronquist F, Huelsenbeck JP. “MrBayes 3: Bayesian phylogenetic inference under mixed models” Bioinformatics 19, 1572–1574 (2003) is another paper I am totally unfamiliar with. Apparently this MrBayes 3 program is quite commonly used because it was cited over 12,000 times. It is #100 on the list.

You can find an earlier version of this page on my blog.