I like to think of myself as "current" regarding statistical programming, but when I read blogs and attend talks by younger researchers, I am amazed by the number of newer computer languages that are in vogue.
Of course, "newer" depends on how old you are! For many programmers, "newer" means either "since I left school" or "since I arrived at my current job." For me, newer languages include Groovy, Haskell, Julia, and Lua, just to name a few.
Crista Videira Lopes, an academic researcher in programming languages, recently wrote a long but interesting
essay on recent programming languages. She makes several interesting claims:
- Lopes claims that "a considerable percentage of [popular] new languages... were designed by ... kids with no research inclination, some as a side hobby, and without any grand goal other than either making some routine activities easier or for plain hacking fun."
- Lopes argues that "there appears to be no correlation between the success of a programming language" and the "deep thoughts, consistency, rigor" that comes from a programming language that has been designed by a professional researcher.
- Lopes states that "one striking commonality in all modern programming languages, especially the popular ones, is how little innovation there is in them! Without exception, ... they all feel like mashups of concepts that already existed in programming languages in 1979, wrapped up in their own idiosyncratic syntax."
- Lopes decries language proponents who claim "improved software development productivity...without providing any evidence for it whatsoever." In particular, he rails against claims such as "Haskell programs have fewer bugs because Haskell is...."
Lopes does concede that a language that "addresses an important practical need" can become popular, regardless of whether it is professionally designed.
There are interesting parallels between people's attitudes about new programming languages and people's attitudes about new statistical methods. I sometimes hear statisticians rail against newer data mining methods as "black boxes" that are produced by the computer science or machine learning communities. What are the complaints? Well, in analogy with Lopes's arguments, here are some arguments against some newer predictive techniques:
- They are created by people without statistical research backgrounds.
- They can become successful in spite of the fact that they are not the product of "deep thoughts" and "rigor."
- They are mashups of ideas that existed previously, but with their own idiosyncratic terminology.
- They claim improved prediction or classification without providing rigorous proofs.
The opposite argument (that statistics need not be constrained by rigor) is presented a 2001 article, "Statistical Modeling: The Two Cultures," in which Leo Breiman (famous for his work on classification and regression trees, bagging, and random forests) criticizes the statistical community for its commitment to data models. Beiman states that "this commitment has led to irrelevant theory, questionable conclusions, and has kept
statisticians from working on a large range of interesting current problems." Breiman says that "statisticians need to be more pragmatic. Given a statistical problem, find a good solution, whether it is a data model, an algorithmic model...or a completely different approach." With minor modification, his arguments also apply to new programming languages: given a programming problem, find a language that helps you solve it easily.
The Breiman article is followed by criticisms by Sir David Cox and Brad Efron, who defend traditional statistics. Efron's comments begin: "At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way." To me, this sounds like an argument Lopes might favor.
Where do you fall in this spectrum? Are you a fervent proponent of new programming languages or has it been a while since you last learned a new language? Do you gravitate to new data mining techniques or do you favor the statistical rigor of logistic regression and mixed models? What arguments do you use to justify your choices?