ASA Connect

 View Only
Expand all | Collapse all

Department Final is riddled with errors and forces bad stats

  • 1.  Department Final is riddled with errors and forces bad stats

    Posted 12-30-2021 12:47
    Hey everyone,

    Last term I taught an intro to stats class where someone in the dept made a "comprehensive final" for all the class sections. Upon looking at the original final draft, I caught about a dozen errors in the types of methods used to analyze the given data and the answers themselves. I brought up these mistakes to the exam writer. They went back and rewrote several of the problems. But, by the time I got the new final draft, it was too late to make changes a third time. 

    With the second final draft, there were still many errors in how to analyze the data and methods to use. 

    While grading, I noticed that my students were doing their best to analyze the data properly....which often ment they knew what they should do, but, because the author had different ideas, didn't supply the students with the correct information. So, they got stuck. Others looked at the problems, wrote some comments like, "We should use method X with this data. But, we don't have all the info we need." Some students were able to look at what the author wanted and deduced that they, the student, were supposed to use the wrong method, and did so. 

    How would you grade an exam like that? 


    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 2.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 09:34
    Aaaargh! Images of sow's ears and silk purses immediately come to mind. 
    But errors do happen and students have to be given scores, so if throwing out the whole thing and retesting is not an option, what do you do. Errors sometimes (happily rarely) do occur even in professionally built tests despite relentless and careful quality checks.  So what do the pros do?  See Wainer, H. (1983). Pyramid power: Searching for an error in test scoring with 830,000 helpers. The Amer­ican Statistician, 37, 87-91 for one example. Sadly, there are others.
    So here's one formula:
    1.  On most tests just eliminating one (or a few) flawed items still leaves enough others to get an accurate score. That doesn't compensate students for the time that they spent on it but you have to do something. If the error is an ambiguity you can allow several answers. This gives students compensation for their time. Of course, if you essentially allow all answers and so everyopne gets the item correct, that is the same as eliminating the item since it no longer discriminates among the examinees, but that is more ethically palatable.  
    2. Alternatively, if there aren't enough valid items left after eliminating the bad ones to yield reliable scores, you're stuck (even David Donoho's denoising methods may be stuck here, although yiou might ask him). Give everyone the unreliable grade they got on the shortened exam and only count it toward their final grade as you would a quiz of similar length. A standard here is that in making up a final grade you should weigh each of its components by its reliability  (long tests count more than short quizzes which in turn count more tan such unreliable measures as the professor's impression of the student from class participation).

    This problem provides three lessons: (i) more QC , (ii) use many more short test items and fewer long ones - you get more informational bang for the  buck (student time) with well-made multiple choice questions than you do for longer constructed response items -- there's a reason why major exams are mostly multiple choice. In this situation even a few flawed items can be eliminated without having an undue effect on score accuracy, and (iii) more QC.

    I hope his helps. There is a substanial literature on dealing with testing errors. It is unfortunate that such research is needed, but there it is -- to err is ...

    Howard Wainer







    ------------------------------
    Howard Wainer
    Extinguished Research Scientist
    ------------------------------



  • 3.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 15:09
    Whenever I make an exam, if there is a mistake, I will tell my in person students, "Do the best you can." Then give full credit if they put down something that looks reasonable. If it's an online exam, I will send out an email with the correction and allow students that turned it to resubmit it. (I don't believe I should punish my students for my mistakes.) 

    With this exam, each answer had a certain amount of points (1 -4) and skipping those was not much of an option. 


    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 4.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-10-2022 09:23
    Rather than remove the question from scoring or giving everyone full credit, I would lean toward grading those questions straight-up and then adding four points to all scores (or maybe 3 if everyone gets at least 1). That way the few students that somehow managed to give good answers to overly hard questions still get a higher score on that item corresponding to their exemplary performance.

    ------------------------------
    Robert Pearson
    Associate Professor of Statistics
    Grand Valley State University
    ------------------------------



  • 5.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-10-2022 10:31
    Is there any way that you could, without broaching confidentiality, show us at least one of the questions, with both correct and incorrect solutions?

    Giles Warrack

    ------------------------------
    [Giles] [Warrack]
    [Retired]
    [NC A&T State University]
    ------------------------------



  • 6.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-11-2022 14:39
    I can't upload the actual questions. But, I can give some examples of what we saw and had. 

    When it comes to ANOVA test, there were some small issues. There were 4 groups A, B, C, D. Each group had 6 "samples". One of the groups had a much larger variability than the others. (We didn't cover how to handle this. Neither did the other sections.) A question with this data was, "Which group is different form the others?" (1 pt) In my class, we discussed something like FWER and PWER. We discussed the need to change our cut off values for say, a p-value. The other sections didn't.  They were taught to "just look at the data". No, post-hoc test needed.  

    When my students took into account the change in P-values to find significance, there were no "statistically significant" differences. They also did F-tests to see if variances for the groups were too different. (They were.) If they used a pooled T-test to find a difference between these groups, and didn't make any p-value corrections, they would actually find a "significant difference". If they used the non-pooled test, there was no significant difference. (They would use t-tests here because they could only use what was on their calculators.) They also commented on the small sample sizes of the groups and how small sample sizes can lead to replication issues. (All things we spoke about.)




    There was a question where the author of the test wanted students to use a Paired T-test. But, the data was proportions. Later, the author wanted the students to use Chi^2 test. And the data was proportions. 

    We spoke about how if we want to test multiple proportions, we should use a Chi^2 test. We also discussed how, if someone want to create a new test for medical testing, we usually have a "Gold Standard" method to give what we should "expect" and we will look at what the new method shows "observed". We also discussed how this wasn't the best way to do this. But, given what we cover in class, would be the method to use, should it come up. So, when they saw multiple proportions, they went for Chi^2. 

    Looking at some follow up questions with the Chi^2 test. There was a question about, "Which groups seem to deviate the most from expected outcomes?" Some of the groups had a Chi^2 of say 19, 14, 13. These were the "different" groups. The group with a Chi^2 of 12, apparently "not different enough". (Not sure where this comes from.) The other 





    When it comes to linear regression, my class discussed "Statistical Significance" and confidence intervals for coefficients. Using the standard hypothesis tests, Ho: B0, B1 = 0; Ha: B0 and/or B1 =/= 0, they fail to reject H0.

    Now, when they have Y = B0 + B1X, or whatever it was, and B0 and B1 are not statistically significant, I taught them to treat that them as 0. So, no matter the X, Y = 0... +/- error. 

    On top of that, the range of X was (1,000 , 5,000). The range of Y was (50 , 200). Now, interpret Y = B0 + B1X when X = 0. (I told my students using a regression model outside the range of X-values was not a good idea and showed them a couple examples of why that can be a really bad idea.) The data also looked quadratic too. (We discussed that too. Other sections didn't.)



    There were some other issues too. These were the ones that confused my students the most.  

    The original final exam was even worse.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 7.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 10:08
    I wouldn't have given the test. A bad measurement tool would have given you a bad measure of the student's comprehension of the material of the class. As a teacher you have a responsibility to your students as a professional you have a responsibility to yourself. I understand there are a lot of departmental politics at play here, but you were failed. A lot of professionals have gotten in trouble for knowing letting another person's failure be inherited by the next person. After the fact grade what is measurable on the test and commend the students that recognize the errors.

    ------------------------------
    Ben Barnard
    Data Scientist
    Wells Fargo
    ------------------------------



  • 8.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 15:22
    I was tempted to skip the dept final too. But, I am required to give it. The dept's admin asst prints it out. I pick it up maybe a day or 2 before the final. Not much time or ability in there to have them print out my exam for my students elsewhere. 

    We spend most of the class discussing fairly real world situations, what we might see, and how we might go about answering those questions. We discuss what happens when we get a wrong answer or one that runs contrary to what say a boss, wants us to find. Most importantly, we discuss that sometimes there are many ways to look at the data and each view might require a different technique. So, being able to explain why you did 'test X' vs 'test Y' is an important step. They learned those lessons well. 


    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 9.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 12:08

    Why not treat erroneous questions as missing, and prorate the grade based on the answers to the non-erroneous questions so that the total grade is out of 100 (or whatever the examination total is)?

    You are paid to grade exams. But erroneous questions are decorations, like the lines in the paper, and not an actual part of the exam. 

    This of course addresses obvious errors, not differences of opinion.

    In a case where there is insufficient information to answer the question, the correct answer is to identify what information is needed to solve the problem. And this is a good exam question, not a bad one. Life's exam questions are of this kind. Life has very few problems where all needed information is provided up front and one just calculates. A realistic wxam is a better exam than a non-realistic one. 



    ------------------------------
    Jonathan Siegel
    Director Clinical Statistics
    ------------------------------



  • 10.  RE: Department Final is riddled with errors and forces bad stats

    Posted 12-31-2021 15:52
    The number of bad questions was great. Rescaling based upon the good questions would be difficult. 

    Not to mention that some of the good questions are what I call "traps". Doing a 2x2 table gets the point across that you can do a Chi Square test. A 4x4 table is asking you to know how to do it, and be perfect, when you put in your numbers. 

    I use expected outcomes and binomial calcs to demonstrate this. "Suppose I give you 4 numbers and ask you to give me the mean and std dev. If you put 99% of your entries in correctly, how many errors should I expect you to make? What is the probability of perfection? Now redo that with say 25 numbers.... if I enjoy taking off points and making you feel bad, what should I do? What if I want to see that you can DO that type of problem, what should I do?" 

    My exams are long and tough enough. No one gets a perfect score... without extra credit.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 11.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-03-2022 08:21
    I know this is off-topic, but what is an HPC Abuser?

    ------------------------------
    [Giles] [Warrack]
    [Retired]
    [NC A&T State University]
    ------------------------------



  • 12.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-03-2022 12:21
    I have several dual CPU motherboards to create my own " at home" supercomputer (HPC). Along with multiple co-processors to boost my ability to do math computations. 

    In several occasions, I ran so many computations through my computers that they would over heat and I cooked a few CPUs and co-processors..... 

    When I spoke with a pair of computer technicians about ways to keep them cool, or at least cooler, they said, "Dude, that's computer abuse!" So, HPC abuser.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 13.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-12-2022 09:39
    What I would do is not drop anyone's grade based on the exam, and use my best judgement to see if students were demonstrating knowledge of the material by identifying the bad test questions and should have their grade raised based on what they did and how they handled the bad questions.  I would err on the side of generosity.

    ------------------------------
    Laura Kapitula
    Associate Professor
    Grand Valley State University
    ------------------------------



  • 14.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-13-2022 14:35
    Whenever I teach, I try not to needlessly harm the students. 

    I ended up doing almost exactly what you said.

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 15.  RE: Department Final is riddled with errors and forces bad stats

    Posted 01-13-2022 16:00
    That's a tough one.

    I would make a rubric for each question where there is insufficient information. It's not perfect, but it might help you grade fairly. Like -- what would you answer? What would you consider a 'fair' answer or a 'thoughtful' answer? What's a 'poor' answer? Personally, I would grade lightly on those questions and err on the side of the students.

    Robyn

    ------------------------------
    Robyn Ball
    Computational Scientist
    The Jackson Laboratory
    ------------------------------