A recent paper by officers of the Gordon and Betty Moore Foundation—A Preliminary Review of Influential Works in Data-Driven Discovery—identifies papers that have been most influential to applicants to the Moore Foundation’s Data Driven Discovery (DDD) program. As described in the paper, the foundation asked applicants (in the competition's pre-application stage) to list up to five influential works he/she thinks "has helped define the field of data science". From the 1100 applicants, they collected 5,000 references, 53 of which were cited at least six times.
22 papers were cited ten or more times and, of these, nine are written or co-written by statisticians (with the number of times each was cited appearing in the citations below):
[3] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, Springer, 2009, cited 43 times.
[8] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996, cited 19 times.
[9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” The Journal of Machine Learning Research , vol. 3, pp. 993–1022, 2003, cited 19 times.
[10] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977, cited 17 times.
[11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995, cited 17 times.
[12] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001, cited 15 times.
[16] B. Efron, “Bootstrap methods: another look at the jackknife,” The Annals of Statistics, pp. 1–26, 1979, cited 11 times.
[18] J. W. Tukey, Exploratory data analysis. Pearson, 1977, cited 11 times.
[19] J. Pearl, Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, 1988, cited 11 times.
[21] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin, Bayesian data analysis. CRC Press, 2013, cited 10 times.
Six other papers by statisticians made it to the list of 53 papers cited six or more times:
E. R. Tufte, The visual display of quantitative information, 2nd ed. Graphics Press, 2001, cited 9 times.
D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory , vol. 52, no. 4, pp. 1289–1306, 2006, cited 7 times.
Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 289–300, 1995, cited 7 times.
L. Breiman et al., “Statistical modeling: The two cultures (with comments and a rejoinder by the author),” Statistical Science, vol. 16, no. 3, pp. 199–231, 2001, cited 6 times.
J. K. Pritchard, M. Stephens, and P. Donnelly, “Inference of population structure using multilocus genotype data,” Genetics, vol. 155, no. 2, pp. 945–959, 2000, cited 6 times.
R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, 2008, cited 6 times.
In addition, the Reverend Bayes's posthumous paper was cited eight times:
M. Bayes and M. Price, “An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFRS,” Philosophical Transactions (1683-1775), pp. 370–418, 1763, cited 8 times.
While the title of the Moore Foundation’s paper describes a preliminary review and its concluding remarks section lists among other things the caveat that their "competition was for efforts in the natural sciences and methodologies, and therefor references important to social sciences are underrepresented in this sample", the number of papers by statisticians is consistent with the recent ASA statement on The Role of Statistics in Data Science. The statement points out that statistics is central to data science. I also find it interesting that being cited by just 1% of the applicants (10 of 1100) puts a paper in the top 1% of the papers (53 of 5000). This is indicative of the diversity of applications, methods, researchers, and reading lists within Data Science. Perhaps there will be more agreement about what constitutes data science and what works best define it as the field matures.
The findings of the Moore Foundation paper are in line with the overall importance of statistics and statisticians to advancing science, discussed in the blog entry, Statisticians Prominent in Top 100 Cited Articles List. In that blog entry, it is noted that, according to a recent Nature article (The top 100 papers: Nature explores the most-cited research of all time), nine of the 100 most-cited articles are by statisticians and another two are heavily statistical.
See other ASA Science Policy blog entries. For ASA science policy updates, follow @ASA_SciPol on Twitter.