Stephen Simon wrote:
> My topic would be "Errors and Negligence Handling Data." ... My lecture would be more about sloppy research practices than fraudulent research practices.... I am particularly interested in
> 1. colorful anecdotes about errors or negligence.
I have plenty of anecdotes about the more mundane applications of statistical data handling, which might or might not be considered "research," and so will offer just two as examples.
A consultant for an industrial facility (for GTE, a large well-known US manufacturer) set up a computer spreadsheet in which quarterly environmental monitoring data could be entered. This spreadsheet conducted statistical analyses (according to regulatory guidelines) and included tests to determine whether the facility remained in compliance with its pollution discharge limits. It was duly turned over to the plant manager, who delegated a receptionist to enter the data as they came in. She would print it out, he would sign it--attesting that it was true, complete, and accurate under penalty of law--, and submit it to the US EPA.
I was contacted when a new environmental consultant became suspicious because he had seen data that suggested the pollution permit was being violated but all the quarterly reports said things were fine. After doing some spreadsheet forensics for the previous five years of reports, I discovered that someone had inadvertently corrupted some of the calculation cells (which were not protected or hidden) early on. (most likely from a stray keystroke during data entry). Because the spreadsheet for each quarter was created by copying the previous quarter's file, the corruption was propagated to all subsequent analyses. The plant had to correct and re-file years of reports, as well as take immediate action to respond to its violations. I do not know whether it was otherwise penalized.
(Although spreadsheets are implicated in the majority of sloppy-data-handling anecdotes, that is partly because they are the only tool most people used to use to store and manage their data. The modern tendency to use database software hasn't eliminated sloppiness, however: it just papers it over with a veneer of sophistication.)
Some stories I can tell may straddle the line between sloppiness (or negligence) and willful perversion of the data. These tend to come from legal cases. Many result from the ploys used by lawyers to make data analysis difficult for the opposing party. However, in one situation concerning a $4.5 billion claim for damages, Plaintiff's experts literally made up the data on which their entire case hinged (asserting they were performing a kind of imputation of groundwater contamination data). Although those "experts" (it was actually a Master's student in geostatistics who did the work) were backed up by a well known and respected statistician, his testimony could not rescue them. Within weeks of this finding, and partly on its basis, the court issued a summary judgment releasing the Defendant (the US Department of Justice) from liability in this case. (This is mentioned in
The Rise of Natural Resource Damage Claims..., p. 8: "the State had failed to prove the existence of any recoverable damages associated with the groundwater contamination. ... The court also noted that there was no evidence of deep contamination...")
>2. written guidelines on good practices (e.g., CDISC, Reproducible Research)
>3. red flags for things that warrant special attention (e.g., missing values, dates on both sides of Y2K)
For another anecdotes, a direct response to (3), and some suggestions concerning (2), please see my posts at "
Essential Data Checking Tests" and "
QA/QC Guidelines for a Database" on the StackExchange statistics site,
http://stats.stackexchange.com/. -------------------------------------------
William Huber
Quantitative Decisions
-------------------------------------------