I can't upload the actual questions. But, I can give some examples of what we saw and had.

When it comes to ANOVA test, there were some small issues. There were 4 groups A, B, C, D. Each group had 6 "samples". One of the groups had a much larger variability than the others. (We didn't cover how to handle this. Neither did the other sections.) A question with this data was, "Which group is different form the others?" (1 pt) In my class, we discussed something like FWER and PWER. We discussed the need to change our cut off values for say, a p-value. The other sections didn't. They were taught to "just look at the data". No, post-hoc test needed.

When my students took into account the change in P-values to find significance, there were no "statistically significant" differences. They also did F-tests to see if variances for the groups were too different. (They were.) If they used a pooled T-test to find a difference between these groups, and didn't make any p-value corrections, they would actually find a "significant difference". If they used the non-pooled test, there was no significant difference. (They would use t-tests here because they could only use what was on their calculators.) They also commented on the small sample sizes of the groups and how small sample sizes can lead to replication issues. (All things we spoke about.)

There was a question where the author of the test wanted students to use a Paired T-test. But, the data was proportions. Later, the author wanted the students to use Chi^2 test. And the data was proportions.

We spoke about how if we want to test multiple proportions, we should use a Chi^2 test. We also discussed how, if someone want to create a new test for medical testing, we usually have a "Gold Standard" method to give what we should "expect" and we will look at what the new method shows "observed". We also discussed how this wasn't the best way to do this. But, given what we cover in class, would be the method to use, should it come up. So, when they saw multiple proportions, they went for Chi^2.

Looking at some follow up questions with the Chi^2 test. There was a question about, "Which groups seem to deviate the most from expected outcomes?" Some of the groups had a Chi^2 of say 19, 14, 13. These were the "different" groups. The group with a Chi^2 of 12, apparently "not different enough". (Not sure where this comes from.) The other

When it comes to linear regression, my class discussed "Statistical Significance" and confidence intervals for coefficients. Using the standard hypothesis tests, Ho: B0, B1 = 0; Ha: B0 and/or B1 =/= 0, they fail to reject H0.

Now, when they have Y = B0 + B1X, or whatever it was, and B0 and B1 are not statistically significant, I taught them to treat that them as 0. So, no matter the X, Y = 0... +/- error.

On top of that, the range of X was (1,000 , 5,000). The range of Y was (50 , 200). Now, interpret Y = B0 + B1X when X = 0. (I told my students using a regression model outside the range of X-values was not a good idea and showed them a couple examples of why that can be a really bad idea.) The data also looked quadratic too. (We discussed that too. Other sections didn't.)

There were some other issues too. These were the ones that confused my students the most.

The original final exam was even worse.

------------------------------

Andrew Ekstrom

Statistician, Chemist, HPC Abuser;-)

------------------------------

Original Message:

Sent: 01-10-2022 10:31

From: Anthony Warrack

Subject: Department Final is riddled with errors and forces bad stats

Is there any way that you could, without broaching confidentiality, show us at least one of the questions, with both correct and incorrect solutions?

Giles Warrack

------------------------------

[Giles] [Warrack]

[Retired]

[NC A&T State University]

Original Message:

Sent: 01-10-2022 09:23

From: Robert Pearson

Subject: Department Final is riddled with errors and forces bad stats

Rather than remove the question from scoring or giving everyone full credit, I would lean toward grading those questions straight-up and then adding four points to all scores (or maybe 3 if everyone gets at least 1). That way the few students that somehow managed to give good answers to overly hard questions still get a higher score on that item corresponding to their exemplary performance.

------------------------------

Robert Pearson

Associate Professor of Statistics

Grand Valley State University

Original Message:

Sent: 12-31-2021 15:08

From: Andrew Ekstrom

Subject: Department Final is riddled with errors and forces bad stats

Whenever I make an exam, if there is a mistake, I will tell my in person students, "Do the best you can." Then give full credit if they put down something that looks reasonable. If it's an online exam, I will send out an email with the correction and allow students that turned it to resubmit it. (I don't believe I should punish my students for my mistakes.)

With this exam, each answer had a certain amount of points (1 -4) and skipping those was not much of an option.

------------------------------

Andrew Ekstrom

Statistician, Chemist, HPC Abuser;-)

Original Message:

Sent: 12-31-2021 09:34

From: Howard Wainer

Subject: Department Final is riddled with errors and forces bad stats

Aaaargh! Images of sow's ears and silk purses immediately come to mind.

But errors do happen and students have to be given scores, so if throwing out the whole thing and retesting is not an option, what do you do. Errors sometimes (happily rarely) do occur even in professionally built tests despite relentless and careful quality checks. So what do the pros do? See Wainer, H. (1983). Pyramid power: Searching for an error in test scoring with 830,000 helpers. *The American Statistician*, *37*, 87-91 for one example. Sadly, there are others.

So here's one formula:

1. On most tests just eliminating one (or a few) flawed items still leaves enough others to get an accurate score. That doesn't compensate students for the time that they spent on it but you have to do something. If the error is an ambiguity you can allow several answers. This gives students compensation for their time. Of course, if you essentially allow all answers and so everyopne gets the item correct, that is the same as eliminating the item since it no longer discriminates among the examinees, but that is more ethically palatable.

2. Alternatively, if there aren't enough valid items left after eliminating the bad ones to yield reliable scores, you're stuck (even David Donoho's denoising methods may be stuck here, although yiou might ask him). Give everyone the unreliable grade they got on the shortened exam and only count it toward their final grade as you would a quiz of similar length. A standard here is that in making up a final grade you should weigh each of its components by its reliability (long tests count more than short quizzes which in turn count more tan such unreliable measures as the professor's impression of the student from class participation).

This problem provides three lessons: (i) more QC , (ii) use many more short test items and fewer long ones - you get more informational bang for the buck (student time) with well-made multiple choice questions than you do for longer constructed response items -- there's a reason why major exams are mostly multiple choice. In this situation even a few flawed items can be eliminated without having an undue effect on score accuracy, and (iii) more QC.

I hope his helps. There is a substanial literature on dealing with testing errors. It is unfortunate that such research is needed, but there it is -- to err is ...

Howard Wainer

------------------------------

Howard Wainer

Extinguished Research Scientist

Original Message:

Sent: 12-30-2021 12:47

From: Andrew Ekstrom

Subject: Department Final is riddled with errors and forces bad stats

Hey everyone,

Last term I taught an intro to stats class where someone in the dept made a "comprehensive final" for all the class sections. Upon looking at the original final draft, I caught about a dozen errors in the types of methods used to analyze the given data and the answers themselves. I brought up these mistakes to the exam writer. They went back and rewrote several of the problems. But, by the time I got the new final draft, it was too late to make changes a third time.

With the second final draft, there were still many errors in how to analyze the data and methods to use.

While grading, I noticed that my students were doing their best to analyze the data properly....which often ment they knew what they should do, but, because the author had different ideas, didn't supply the students with the correct information. So, they got stuck. Others looked at the problems, wrote some comments like, "We should use method X with this data. But, we don't have all the info we need." Some students were able to look at what the author wanted and deduced that they, the student, were supposed to use the wrong method, and did so.

How would you grade an exam like that?

------------------------------

Andrew Ekstrom

Statistician, Chemist, HPC Abuser;-)

------------------------------