ASA Connect

 View Only
Expand all | Collapse all

Elephant

  • 1.  Elephant

    Posted 06-30-2017 11:36

    P-values, operating characteristics, loss functions, and posterior probabilities:

    We need the whole animal.

     

    Statistics is surely one of the most weighty and regal creations ever to range across the landscape of science.  Sadly, our subject is too often a target of short-sighted partisans, averse to subtlety and intent on abstract absolutes. "This piece of the beast that I currently embrace is all there is." (Yes-or-never to p-values; Bayesian posteriors only; it's all just decision theory, or just effect sizes, or robust methods, or mainly about power).  These advocates would have us carve our magnificent elephant into parts whose sum is less than the whole.  Some would reduce statistics to just one of its supporting legs.  Others would amputate the trunk and claim that what remains is still a whole elephant.

    Was Fisher naïve?  Were Neyman, Pearson, and Wald misguided?  What about Laplace, Good, and Savage?  Who among us can claim to see farther than they did?

    I remain convinced that analysis of data is a place where abstract theory and interpretation-in-context rub uneasily together.  The first, abstract theory, takes mathematics as its model; the second, interpretation-in-context, is our version of exegesis, with numerical data as text.  Pascal was right to distinguish two complementary forms of reasoning, the "spirit of geometry" and the "spirit of subtlety".  Their essential tension is what energizes and empowers our progress.



  • 2.  RE: Elephant

    Posted 07-05-2017 09:34
    As usual, George is both insightful and eloquent. We should all copy this and post it.

    — Paul




  • 3.  RE: Elephant

    Posted 07-05-2017 09:35
    There are plenty of issues with science and statistics. I gave a talk a few weeks ago about all the misrepresentations about statistics as taught in a typical chemistry curriculum. (Basically everything is wrong or poorly used.)

    But, part of the problem is us, the statistician. 

    We teach many of our classes assuming that computers don't exist. We want to force theory and mathematical "proof" into lots of places where it is not needed.

    We assume that the software that we do use, uses the same formulas as we find in our textbooks. (WRONG!!!!!!!!!!!!!!!!!!!!!!!!)

    We make silly assumptions like everything is normally distributed (or approximately) until proven otherwise. From a hand calc point of view, this makes sense. Most of our formulas are based upon this idea. BUT.... we all carry calculators with us. Some of us have stats apps on our phones too. Even when we know the data is NOT from a normal distribution, we will still pretend it is. (How many of us don't automatically use a transform for Poisson Distrbuted data? Logit data? WHY?) Can we agree that we will use the APPROPRIATE DISTRIBUTION when possible? 

    We make our students learn complicated programming languages to do stats, when a drop down menu driven programs do just as well, if not better. (WHY????) How much more material material could we cover in a class if we decided to use SPSS, STATA, JMP, Minitab instead of R or SAS and only dedicated 5 minutes per class to getting results? 

    We make models with our data. Do we ever check how good they are?
    How many of us use a Durbin-Watson test and a K-S normality test over a visual plot in a regression? 
    How many of us find the "optimal" solution for our model as a check of how well it performs? (A few textbook models said smoking decreases disease rates.)
    What about going back and using some of the data to make a model and the other data to check the model (training and testing data), like we do in data mining? (Lots of the textbook logistic regression models fail to predict the "event of interest" well. Tweaking the model can help a lot.... but, we'd give a student a failing grade if they did that in our classes. But, the models "look good" if we don't subject them scrutiny!)

    Most of us are not very knowledgeable about topics outside of math/stats. So, when someone tells us about the data, we have to assume the person collecting the data knows what they are doing.  How many scientists know the difference between a replication and a repeated measure? (Do a web search for Design of Experiments and Designing a scientific experiment. How often do "scientific experiments claim you can't change more than on thing at a time?)

    Perhaps worst of all, many of us are afraid to find things out. If you have 40 independent variables and you're told to use only __,__,__ and ___. How many of you do that and don't look at the other variables? For those of you looking into medical data, how often do use use the hospital as a blocking factor or look at the attending doctor(s)? If you are looking at how to improve student performance, do you ever look at the professor? (HINT: Professor tends to have a larger effect on student outcomes than student preparation!)

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 4.  RE: Elephant

    Posted 07-06-2017 09:13
    A word in defense of approximately normally distributed. This seems out of character for me, since I agree whole-heartedly with the concerns raised in this thread and even teach a class with a very brief introduction to 20+ different regression methods because most people only use a few. 

    But first a word in defense of "roughly mound-shaped" distributions. Of course, the Normal falls off exponentially on either side of the mean. As a result, in actual practice, the difference in model results and predicted outcomes between approximately normal and a perfect normal distribution - which only exists in homework problems - can be smaller than real-world measurements. 

    The point here is that we need to remember nearly all of our processes are approximations - a different way of saying Box's maxim that "all models are wrong but some are useful". We should carefully avoid the error of thinking all those digits generated in our Results Window actually mean anything. If the difference is smaller than can be measured and reproduced by experiment, it doesn't exist. 

    Normal or Poisson? ARIMA or GARCH? Yankees or Red Sox? (of course, this only refers to the different statistical methods they use!) Seriously, Company's A's favorite software and methods versus Company B?  All are approximations and, as mentioned earlier, the bilateral exponential decay of the Normal distribution, if abused, can hide a multitude of sins. In deciding which methods to use, we need to consider characteristics such as bias, missing data, variance properties such as heteroscedasticity,  and other factors. Multiple methods can be tested against each other, always relying on experimental outcomes to guide us to the best of all sub-optimal choices.

    All of this can only happen when we learn and actively employ a wide variety of methods in actual practice. 





  • 5.  RE: Elephant

    Posted 07-09-2017 21:49

    In regards to the comments made, I completely agree on most.
    The one I would like to point out is that the drop down menu's for statistical software
    (by definition from the user interface) don't meet the legal requirements for visual impairment
    and universal design standards.
    Universities are very strictly controlled regarding the ADA and it has been a real struggle
    for the past three years to get our university to allow us to use Minitab while providing
    R as an alternative (it took demanding that a course in programming would be a prerequisite
    to introductory statistics for those decision makers to back down).  
    I think we are doing a disservice to students by not allowing drop-down menu style statistics
    for entry level classes because it reduces the time we have to point out complexities and
    leaves the students with the impression that "linear regression works on everything" or
    "t-tests are all we need" but, because we are a general education area, other fields are
    allowed the freedom to use these software programs where we are not.  


    Joseph Reid






  • 6.  RE: Elephant

    Posted 07-10-2017 09:32
    I think there is a much simpler answer than that. 
    R is free. (MRO is free and far, far, ffffffffffffffffffffffffaaaaaaaaaaaaaaaaaaaaaaaaaaarrrrrrrrrrrrrrrrrrrrrrrrrr better.)  
    SAS has a free "University Edition".


    SPSS, STATA, JMP, Minitab, Design Expert, etc, cost money. 

    I doubt any university would ever say, the ADA makes us use R or SAS university edition.... unless you can use that scare to get free software.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 7.  RE: Elephant

    Posted 07-11-2017 09:25
    I have considerable experience with R and think it is a valuable software for researchers within the health sciences (I tried to find MRO to compare, is there a link?). Simple use of R does not require advanced programming skills, particularly when using the free RStudio interface. There is also drag and drop functionality in some R packages including Rcmdr and its increasing number of plug-in packages. Concerning the commercial programs there is a cost, but the cost varies a lot. Stata, a nice program, is moderately priced and has more functionality than some more expensive programs. R and Stata together will cover most needs I think. Stata has a drag and drop interface available but is mostly used via the command line, and then the thresholds for R and Stata are similar (and  moderate).  I don't think one should teach students a free university version of programs for which the full version has a high price. Then, after their studies, they will need the full version which their employer may not afford within reasonable limits.

    ------------------------------
    Tore Wentzel-Larsen
    Norwegian Centre for Violence and Traumatic Stress Studies
    Regional Center for Child and Adolescent Mental Health, Eastern and Southern Norway





  • 8.  RE: Elephant

    Posted 07-11-2017 11:53
    MRO is Microsoft R Open. Available here.

    https://mran.microsoft.com/open/

    Something else to consider:
    I was talking with my advisor about the results of multiple data analyses. I decided to use JMP and Xlminer. The others used Python, R and Matlab. During the discussions I was able to reanalyze the data with XLMiner (which has drop dfown menus) while talking to everyone. Made about a dozen changes too. It took me seconds to redo the analyses in front of everyone. The amount of time it takes to do the coding for all of those changes, knowing what was going to be asked, would be minutes, use lots of redundant lines of code and would take a lot more time to get the same plots and graphics, is R or Python even allows it.

    But, fast and good costs money out of pocket and R and SAS are "free" with a lot of upfront investment in time and frustration.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 9.  RE: Elephant

    Posted 07-12-2017 13:15
    I have noticed some comments about MRO and am especially puzzled when I read that MRO is far better than R.  I looked on the MSFT site https://mran.microsoft.com/rro/ but could not really see what the advantage would be to use MRO versus R.  MSFT is usually good at advertising their products but the extra explanation gives no indication that MRO is better than R.
    Any experiences that you care to share about using MRO would be most helpful.

    Filiep





  • 10.  RE: Elephant

    Posted 07-13-2017 12:43
    MRO has an upgraded BLAS and LAPACK via the Intel MKL. These are the math libraries that R uses for computations.Intel hired software engineers and computational mathematicians to develop the libraries to run parallel computations. The standard R math libraries are pretty bad by comparison.  

    If you want to have some fun, see how good MRO is compared to R, create a 10,000x10,000 matrix and multiply it by itself. look at the system time for the calculation using MRO and R.  MRO can do in seconds what R does in minutes to hours. 

    If you use Microsoft R Server, you can run data in seconds that is simply impossible to run with standard R.

    I ran this little script in R-Studio using MRO just to show the difference. (With results)

    > require("expm")
    > a=5000
    > b=0.000000001
    > c=matrix(0,a,a)
    > for ( i in 1:a){c[i,i] = 1-1*b}
    > for ( i in 1:a){c[i-1,i] = b}
    > c[a,a]=1
    > c[a,a-1]=0
    > system.time(c%^%2)
    user system elapsed
    13.25 0.18 3.68
    > setMKLthreads(1)  **** This is some code that only works with MRO. I tells the BLAS and LAPACK to run things on 1 thread (core). Normally, this will be 2 cores with an I3 or I5 processor and 4 cores on an I7 processor.  
    >
    > system.time(c%^%2) **** This is the result of using 1 core, just like standard R.
    user system elapsed
    10.70 0.13 10.58

    When I ran the code in standard R, I got this:

    > system.time(c%^%2)
    user system elapsed
    80.67 0.17 80.85

    So, a restricted MKL runs the code 8 times faster than standard R. When it uses the 4 cores, it's 21 times faster. That's how good the MKL BLAS and LAPACK are... or how bad the standard R BLAS and LAPACK. 

    On one of my home computers, I run this bit of code in a larger program but with a=70,000 to 80,000 and perform matrix multiplication 60 times. It takes my home computers with 128GB of ram and 10 cores to 16 cores about a day. I won't even try doing this with R. If you set a=10,000 in standard R, you'll see how long that takes.   

    Why does this matter? If you took a class in algorithm design, you learned about Big O notation. Squaring an NxN matrix uses a little over N^3 computations. So, 10,000x10,000 uses 10,000^3 (10^12) computations. If you have 100,000 tuples by 40 columns of data, the software will make over 100,000x40x100,000 (4.0x10^11) computations. Since matrix multiplication is the core of regression computation using the best LAPACK and BLAS is a good idea. Sometimes, it's a necessity.

    Since MRO is R, just better, I won't even touch standard R. (unless I need to run a benchmark to show how bad it is;-) If you do analytics with large data sets, MRO is it.

    Another nice thing about MRO, you can use is as a substitute for MatLab and out perform MatLab. Oddly, MatLab uses the same math libraries as MRO. 


    Hope this helps. 




    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 11.  RE: Elephant

    Posted 07-07-2017 17:13
    ​Whether or not the normal approximation is "good enough" often depends on the question.  Are we interested in estimating the "true value" of a population or are we interested in the proportion of individuals who fall outside of some limit(s)?  If the latter, are we talking about modest proportions or tiny proportions?

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org
    ------------------------------



  • 12.  RE: Elephant

    Posted 07-05-2017 09:46
    Beautifully put, George. I'd only demur on the middle passage, which has a bit too much idolatry of past giants for my taste: We have to be especially alert to the oversights and mistakes of Fisher, Neyman, Pearson, Wald, Laplace, Good, Savage, other favorites, and of course ourselves most of all (Fisher and Neyman were hardly shy about pointing out each other's mistakes!). We can indeed see farther than they did thanks to them and all that has come since, if we don't let ourselves be blinded by sanctification of their work or get mired in the confused foundational commitments that characterized too much of 20th-century statistics. 

    Here's my Monday-morning (2017) quarterbacking on that confusion:

    Common elementary discussions of Bayesian vs. frequentist (including my own) deceptively imagine just two types of statistician, a "subjective" or "betting" Bayesian and an "objective" or "algorithmic" frequentist. But of course there is a vast range and variety of statisticians, so these two idealized types are simply natural reference points that should not be taken as absolutes; in fact the statisticians I respect fall away from either extreme.

    The ecumenic (AKA 'toolkit' or 'syncretic' or 'Boxian') view is that Bayesian and frequentist methods are just classes of tools for data analysis, the main classes in use but not the only ones; they are designed to answer different questions which can be asked at the same time of the same method. Bayesian tools are supposed to be oriented toward building models that capture background information, while frequentist tools are supposed to be oriented toward calibrating models against data frequencies. Thus they are actually complementary toolkits, not in conflict as most literature from the last century made it seem. [I wrote "supposed to be oriented" because there seems to be much confusion about how to use and interpret these tools and their outputs properly. So again: a Bayesian can view frequentist methods as calibration tools and frequentists can view Bayesian methods as model-specification tools.]

    Furthermore, the two toolkits can be merged using hierarchical models with both Bayesian and frequentist interpretations. From those models, the analyst can both generate one-off bets (Bayesian inferences) and admissible frequentist decisions for "long-run" error control (subject to the coherency violations needed to insure the latter in high dimensions). Thus one can be Bayesian and frequentist at the same time, as long as one is not being too insistent on unattainable goals like complete frequency robustness or perfect Bayesian coherency. Paradoxes in which the two toolkits seem to conflict can then be viewed as stemming from particular interpretations and assumptions rather than from the toolkits themselves, especially in failure to appreciate that the toolkits answer different questions.

    Furthermore, both toolkits face the deep problem of model misspecification: Statistical computations are always predicated on a sampling (observation) model whose status is uncertain (outside of some physics applications), so that formal statistical inferences and decisions (regardless of its 'school') cannot capture the full uncertainty needed in an application. But teachings in both traditions have traditionally relied on models (both for sampling and for priors) which in soft-science applications cannot be very correct, so at best are only using rough hypotheticals to generate inputs ('statistical inferences') for informal synthetic judgments. 

    I think the 'toolkit' view just outlined is much like the views advocated by frequentist-leaning statisticians like Cox and Efron, as well as by Bayesian-leaning statisticians like Box and Rubin, who have at times commented on harms from excessive formalization or philosophical rigidity in approaching a genuine applied goal.

    ------------------------------
    Sander Greenland
    Department of Epidemiology and Department of Statistics
    University of California, Los Angeles
    ------------------------------



  • 13.  RE: Elephant

    Posted 07-06-2017 09:14
    ​Nicely put George!  And good responses from everyone!  I love to read discussions among statisticians when there are no equations involved.  The elephant is alive and well, although still at risk of being poached for its valuable, but misunderstood parts.

    ------------------------------
    Susan Spruill
    Statistical Consultant
    ------------------------------