Discussion: View Thread

Help with lecture on "Errors and Negligence Handling Data"

  • 1.  Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 12:54
    I've volunteered to give a three hour lecture as part of a Responsible Conduct of Research class. My topic would be "Errors and Negligence Handling Data." You could say that I've seen a lot of negligent data handling. Heck, I've committed enough errors myself to qualify as an expert on this topic.

    I thought that others in the Statistical Consulting Section, though, would have even more to share with me. I am particularly interested in

    1. colorful anecdotes about errors or negligence.

    2. written guidelines on good practices (e.g., CDISC, Reproducible Research)

    3. red flags for things that warrant special attention (e.g., missing values, dates on both sides of Y2K)

    4. actual case studies or published examples of bad data handling.

    I should point out that there is already a lecture on falsification and fabrication. My lecture would be more about sloppy research practices than fraudulent research practices.

    Any help you can provide would be greatly appreciated.

    -------------------------------------------
    Stephen Simon
    Independent Statistical Consultant
    P. Mean Consulting
    -------------------------------------------


  • 2.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 13:39
    Stephen:

      RE your first topic, how about the NASA probe that crashed onto Mars (rather than soft landing on it) because of some kind of a mistake in units of measurement (English vs metric) being used at some point(s) in the navigational control system instructions ?

      Just a suggestion that seems to fit what you might be looking for here. 

    -------------------------------------------
    Lance Heilbrun
    Karmanos Cancer Institute
    -------------------------------------------








  • 3.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 14:41
    I just did a little browsing on the web to confirm the story that NASA had not detected the ozone hole for 9 years because their computer software was deleting as outliers the low ozone values from their satellite dectectors over Antarctica. This web page indicates that that well known story is apocryphal. NASA had flagged the low ozone values nearly as soon as they'd appeared, and was carefully checking them, and had submitted an abstract reporting low ozone concentrations when another group published just ahead of them:
    http://www.statsci.org/data/general/ozonehol.html

    There are other examples of outliers that may be equally apocryphal. Fisher pointed out that Mendel's data are too close to expectation, indicating that he or his assistant had fudged the data, perhaps throwing out data from crosses that did not fit the 1:2:1 ratios. Bruce Weir in his book on quantitative genetic data analysis was even more severe in saying Mendel misanalyzed his data egregiously. But, Mendel has his defenders who discount Fisher's analysis.

    The controversy is summarized by Pires & Branco (2010):
    http://arxiv.org/pdf/1104.2975.pdf

    Which reminds me of the other classic case of data being too close to expectation: Cyril Burt's identical twin studies. Burt produced correlations that were nearly the same from study to study. He had coauthors that supposedly did the experimental work for his analyses, but it turns out that the co-authors didn't exist and the results were probably fabricated. Google scholar reveals quite a bit of literature pro- and con on Burt's alleged fraud.

    There is another perhaps apocryphal story from my own field of biological oceanography. The story has it that estimates of food flux to the deep sea using particle interceptors (essentially suspended trash baskets) were underestimated because of outlier removal. Sometimes a fish will hover near a trap and defecate or a zooplankter will die in the particle collector. These high organic carbon values were deleted from the values used to calculate the mean organic matter flux. Later, the story goes, it was discovered that a substantial portion of the deep sea food flux comes in pulses and that by eliminating the extreme right tail of the food flux distribution, biological and chemical oceanographers were greatly underestimating the food flux. Now, like the ozone hole outlier story, this story too may be apocryphal.

    -------------------------------------------
    Eugene Gallagher
    Associate Professor
    Univ of Massachusetts
    -------------------------------------------








  • 4.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 18:08
    Hi All,

    Here is another example of an error in handling units: you may have heard about Gimli Glider, e.g., see
    http://en.wikipedia.org/wiki/Gimli_Glider
    You'll find a lot of stuff  on the Web searching for "Gimli Glider".

    Cheers,
    Sergei
    -------------------------------------------
    Sergei Leonov
    AstraZeneca
    -------------------------------------------








  • 5.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 18:49
    I know of a statistical agency that published some estimates of 0 +/- margin of error.  That had to be a programming error.

    -------------------------------------------
    Charles Coleman
    -------------------------------------------



  • 6.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 12:35
    Some years ago I inherited the follow-up to a project from a colleague who had left the company.  I found I was having difficulty reproducing the earlier results even though I had the original data files. 

    Finally I realized that my colleague was reading in the data using comma-delimited input and had forgotten to specify DSD in the SAS infile statement (or maybe didn't realize that the data set had missing values).  That option indicates to SAS that, if there are consecutive delimiters, the value between the delimiters is treated as missing. 

    So each time a value was missing, as indicated by consecutive delimiters, SAS was instead reading the value of the next non-missing value.  So if they were reading x1 to x10 and x5 was missing, then x1-x4 were correct, x5 took the value of x6, x6 took the value of x7, etc.  It was a frequent enough occurrence to change their results, but not so frequent as to become obvious.

    -------------------------------------------
    Michael Morton
    -------------------------------------------








  • 7.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 13:42
    The canonical source for crazy stories about negligent data handling is the Retraction Watch blog, where they document and follow retractions of scientific papers.

    I have an encyclopedia entry on Reproducible Research located at:

    http://goo.gl/hs7eb

    This discusses the Potti case, the Baltimore case, and several others. We are living in a golden age of stuff that should not happen.

    -------------------------------------------
    Paul Thompson
    Director, Methodology and Data Analysis Center
    Sanford Research/USD
    -------------------------------------------








  • 8.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 16:52

    Thank you Paul for a really interesting article. I think this topic brings up a huge ethical issue that I have been struggling with throughout my years of teaching statistics to non-majors. I work in natural resources, not pharma or medicine, and I regularly read articles in which the statistical models used were obviously inappropriate, e.g. an estimate of population abundance of raptors of 11 +/- 173 (not sure what a negative raptor is...) yet they are published in well-respected journals, and then assumed by other scientists to be correct by virtue of having been published. The problem is that we have somehow come to believe that misuse of statistical tools is not a serious problem. We would never send a novice off to measure protein using the Kjeldahl method without knowing which catalyst to use and that different correction factors are needed for different proteins to account for different amino acid sequences. Yet we easily send novices off to run statistical models not having a clue which "buttons need to be pushed," e.g. whether they might need a random effect, an offset or an overdispersion parameter, etc. Even simple errors like failing to account for different sources of variation at different scales (e.g. data for which mixed models would be appropriate) are extremely easy to commit and can have radical effects on the results. Yet failure to assure that the statistical approach used is appropriate and justifiable is not seen as a breach of ethics.

    It appears that we have come to a point where egregious mishandling of statistics, whether intentional or not, is seen as an acceptable way of doing science. I'm not talking about falsifying data or dropping important outliers. I'm talking about the well-respected researchers who confess to me a poor understanding of statistics at the same time as they conduct their own analyses and base all their inference on their "statistical" results. How can it not be a breach of ethics to apply a tool whose proper use you admit to not knowing?


    -------------------------------------------
    Manuela Huso
    Research Statistician
    US Geological Survey
    -------------------------------------------








  • 9.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 22:52
    Manuela:

    You bring up a serious concern for many statisticians. For many scientists, they would never allow an uncertified and untrained person to perform a biological assay, but they insist on running their own data analysis. Plus they use GraphPad Prism, which uses incorrect methods for repeated measures data (with more than 2 repeated measurements).

    I work with many persons who are basic scientists. They do their own lab work, and many do not ask me about their statistical methods. I also know that there is a huge problem with reproducibility in biology, and partly this is due to the use of tools that are "user friendly" but also "statistically idiotic".

    -------------------------------------------
    Paul Thompson
    Director, Methodology and Data Analysis Center
    Sanford Research/USD
    -------------------------------------------








  • 10.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 13:52
    hi

    I actually just taught my annual session on the topic of research integrity and statistical ethics this morning (finishing less than 1 hour ago) to our new residents.

    the single most useful exercise is to break into small groups of 2-4 people and discuss the cases contained in this Brigham et al. 2004 publication entitled Ethical Dilemmas in Research Integrity
    https://www.google.com/url?q=http://ori.hhs.gov/education/products/metalinker_round1/Dilemmas.doc&sa=U&ei=jfwUUo7UFYSbrAGc1oHoBA&ved=0CAoQFjAB&client=internal-uds-cse&usg=AFQjCNHoyv4PKqiLy6yS_qKZuxzxihSnZg

    you can review cases reported on the US DHHS Office of Research Integrity website looking for ones that concern the issues you seek
    http://ori.dhhs.gov/case_summary
    for example, here is one of someone who fabricated enrollment data in an annual report to NIH
    http://ori.dhhs.gov/content/case-summary-zach-calleen-s

    also here is an ASA link to some relevant info.
    http://www.amstat.org/committees/ethics/links.cfm

    hth
    Stuart

    -------------------------------------------
    Stuart Gansky
    John C. Greene Professor of Primary Care Dentistry
    University of California, San Francisco
    -------------------------------------------








  • 11.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 14:09
    FYI, even though it's an "example" from another discipline, you might check out the issue concerning some economics work where an error in processing data with EXCEL ignited some pretty heated discussions.  Here's a web-link (http://www.newyorker.com/online/blogs/johncassidy/2013/04/the-rogoff-and-reinhart-controversy-a-summing-up.html)
    to just one of many that come up from a GOOGLE search under "Rheinhart-Rogoff Error".

    I expect you've probably heard about this issue anyway; but it's definitely an error that had more than within-discipline impacts.


    -------------------------------------------
    Bruce Wetzel
    -------------------------------------------

     Bruce M. Wetzel, Forecasting & Analysis Advisor
    Southern California Gas Company
    Regulatory Affairs/Gas Demand Forecasting & Analysis,
    Phone: (213)-244-3857; ML: GT-14D6
    EMail: bwetzel@semprautilities.com






  • 12.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 16:52
    Try the horror stories from the European spreadsheet risks special interest group.
      I like to have my colleagues who insist on using excel read the links below

    http://www.eusprig.org/horror-stories.htm
    and

    and McCulloughs Paper on errors in some excel statistical procedures

    http://www.forecastingprinciples.com/files/McCullough.pdf

    -------------------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www,barkerstats.com

    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    -------------------------------------------








  • 13.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 17:23
    A long time ago, I dealt with disease data that had been filled in by 4 or 5 different people, each of whom used their own coding scheme.

    The disease test was SUPPOSED to be marked P (for positive), N (for negative), and I (Indecisive). All the coders were told this. Simple, right?

    Nooooooo!

    I got P and p and N and n and I and i.  Not so bad so far.  I also got + and - and pos and neg  (in various capitalizations). BUT sometimes the test was done twice. Rather than tell the coders what to do, it was left to them. Which means I also got all those codes doubled. Sometimes with slashes in between.

    AND this was at the beginning of my career (first job after grad school). I was working in SAS, but didn't know much.... I didn't know about string functions, in particular.

    That was a lot of IF THEN coding!

    Peter


    -------------------------------------------
    Peter Flom
    -------------------------------------------








  • 14.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-21-2013 17:41
    My favorite example was a researcher that inherited a project where the previous leader had relied upon medical students to extract the study data based on chart review. The students naturally used Excel, and used multiple files/sheets to capture a few different types of data. The problem is that the students didn't realize that they needed to have some form of ID/name/something to link the various sheets, they just lined things up by each row as they went along. Naturally, this new researcher was looking at the data, threw in a couple of sort operations, and not surprisingly couldn't reproduce any of the previous summaries that had come out of this dataset. Thankfully it wasn't a huge study, because they had to go back to the charts for every single subject.

    Nick

    -------------------------------------------
    Nicholas Pajewski
    Assistant Professor
    Wake Forest University School of Medicine
    -------------------------------------------








  • 15.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 08:14
      The classic study of errors and negligence and handling of data is Stuart Hurlbert's (1984) Ecological Mongographs paper:

    Pseudoreplication and the design of ecological field experiments

     This article documents that the majority of ecological field experiments involved either inadequate descriptions of the statistical analyses or erroneous analyses. Pseudoreplication is Hurlbert's term for model misspecification when an inappropriate error mean square is used to test hypotheses, usually with ANOVA. Hurlbert's review caused a revolution in how ecological studies are conducted and those that can be published in top journals. The Ecological Society of America introduced a policy where a statistician had to review designs in studies suspected of pseudoreplication. Hurlbert's  article has been cited more than 5800 times, and there have been similar analyses published recently in molecular biology and climate change research where the statistical errors are, if anything, worse than in ecology. Just do a google scholar search under "pseudoreplication climate change" for a recent review by Wernberg et al. on statistical errors in climate change research. A google scholar search under pseudoreplication will reveal a huge array of reviews of flawed statistical analyses. Hurlbert (1984) and Tony Underwood in their reviews of the literature named names, citing the most egregious abusers of statistical analysis by name. Others just compile the frequency of errors of different types.

    -------------------------------------------
    Eugene Gallagher
    Associate Professor
    Univ of Massachusetts
    -------------------------------------------








  • 16.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 11:02
    Chris,

    Those are both great references, thanks for sharing them.

    It looks like in the McCullough paper the latest version of Excel they looked at was 2003. Given that there have been several newer iterations since then (2007, 2010, and 2013), I wonder if any of those issues have since been addressed.

    Gabe Farkas


    -------------------------------------------
    Gabriel Farkas
    -------------------------------------------








  • 17.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 10:23
    Stephen Simon wrote:

    > My topic would be "Errors and Negligence Handling Data." ... My lecture would be more about sloppy research practices than fraudulent research practices.... I am particularly interested in
    > 1. colorful anecdotes about errors or negligence.

    I have plenty of anecdotes about the more mundane applications of statistical data handling, which might or might not be considered "research," and so will offer just two as examples.

    A consultant for an industrial facility (for GTE, a large well-known US manufacturer) set up a computer spreadsheet in which quarterly environmental monitoring data could be entered.  This spreadsheet conducted statistical analyses (according to regulatory guidelines) and included tests to determine whether the facility remained in compliance with its pollution discharge limits.  It was duly turned over to the plant manager, who delegated a receptionist to enter the data as they came in.  She would print it out, he would sign it--attesting that it was true, complete, and accurate under penalty of law--, and submit it to the US EPA.

    I was contacted when a new environmental consultant became suspicious because he had seen data that suggested the pollution permit was being violated but all the quarterly reports said things were fine.  After doing some spreadsheet forensics for the previous five years of reports, I discovered that someone had inadvertently corrupted some of the calculation cells (which were not protected or hidden) early on. (most likely from a stray keystroke during data entry).  Because the spreadsheet for each quarter was created by copying the previous quarter's file, the corruption was propagated to all subsequent analyses.  The plant had to correct and re-file years of reports, as well as take immediate action to respond to its violations.  I do not know whether it was otherwise penalized.

    (Although spreadsheets are implicated in the majority of sloppy-data-handling anecdotes, that is partly because they are the only tool most people used to use to store and manage their data.  The modern tendency to use database software hasn't eliminated sloppiness, however: it just papers it over with a veneer of sophistication.)

    Some stories I can tell may straddle the line between sloppiness (or negligence) and willful perversion of the data.  These tend to come from legal cases.  Many result from the ploys used by lawyers to make data analysis difficult for the opposing party.  However, in one situation concerning a $4.5 billion claim for damages, Plaintiff's experts literally made up the data on which their entire case hinged (asserting they were performing a kind of imputation of groundwater contamination data).  Although those "experts" (it was actually a Master's student in geostatistics who did the work) were backed up by a well known and respected statistician, his testimony could not rescue them.  Within weeks of this finding, and partly on its basis, the court issued a summary judgment releasing the Defendant (the US Department of Justice) from liability in this case.  (This is mentioned in The Rise of Natural Resource Damage Claims..., p. 8: "the State had failed to prove the existence of any recoverable damages associated with the groundwater contamination. ... The court also noted that there was no evidence of deep contamination...")


    >2. written guidelines on good practices (e.g., CDISC, Reproducible Research)
    >3. red flags for things that warrant special attention (e.g., missing values, dates on both sides of Y2K)

    For another anecdotes, a direct response to (3), and some suggestions concerning (2), please see my posts at  "Essential Data Checking Tests" and  "QA/QC Guidelines for a Database" on the StackExchange statistics site, http://stats.stackexchange.com/.

    -------------------------------------------
    William Huber
    Quantitative Decisions
    -------------------------------------------



  • 18.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 11:01
    Stephen:

    What a great discussion you have started.  It just goes to prove that no matter how hard one works to make any system 'mistake proof'' there is always someone out there who can beat your system in ways you simply would not imagine.

    I was called in to review a funded project (big dollars) where the preliminary data was not showing anything near what was expected.  At the presentation the data experts (who turned out to be graduate students in their first 'statistics' course hired by the PI who turned out to be "too busy to handle this simple data") presented their findings using Excel bar graphs.  As I tend to be a data person I naturally opened one of the data sets for one of the bar graphs and found one of their basic problems.  The 'experts' had entered their data in another database, exported it to Excel and then run their initial analysis and graphs on the Excel sheet (the underlying problems with this will be another conversation).  However, in their conversion, the data export had changed all of the blanks in the database to zero's on the Excel sheet.  When I first brought up this problem the comment was 'How did you get access to this data?  It's proprietary."  I said, how about clicking on the bar graphs?  Then one of the experts stated that "Since we specified in the original data set that these were blanks then Excel would know that they are blanks.  And besides, Excel ignores zeros."  I can't image what the consequences would have been if the initial findings were close to what was expected.  The anomaly in the results would likely have been ignored and a lot of money spent on nonsense data that may have not been discovered.

    Bill

    -------------------------------------------
    William Grant
    Professor, Emergency Medicine
    SUNY Upstate Medical University
    -------------------------------------------








  • 19.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 11:17
    Yes that thing with blanks and zeros is a disaster with Excel. The basic issue is that MANY people, including data professionals, do not know that there is a difference between a zero (a number indicating no things) and no number (an indication that a measurement was not made). I work with a group that extracts data from a larger database as an "honest broker". I got into a HUGE fight with them because on Round 1, I got blanks, and on Round 2, I got 0s in all the blank spots. Not only was this wrong, but it totally screwed up my conversion routines.

    -------------------------------------------
    Paul Thompson
    Director, Methodology and Data Analysis Center
    Sanford Research/USD
    -------------------------------------------








  • 20.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 11:43
    I received a note from my old friend Al Best indicating that my goo.gl link did not work. This is possibly a firewall issue. The unshortened link is:

    http://nationalethicscenter.org/content/article/175

    Thanks for letting me know, Al. If people have problems accessing this link, let me know.

    -------------------------------------------
    Paul Thompson
    Director, Methodology and Data Analysis Center
    Sanford Research/USD
    -------------------------------------------








  • 21.  RE:Help with lecture on "Errors and Negligence Handling Data"

    Posted 08-22-2013 11:00
    Stephen, under the category of "colorful anecdotes about errors or negligence," I respectfully submit the following tale of woe.

    Several years ago, I was asked to serve as the independent program evaluator for a county-wide public health program for a small but high-risk patient population. The program was grant funded, with a complex evaluation plan supplied by Ph.D. researchers from a major research university. I was the second evaluator on the program. I wasn't the last one. Here's why.

    The wizards who developed the evaulation plan required fairly sensitive performance measurement, with ongoing funding tied to statistically significant changes in certain health outcomes. Stuff like shifts in mean birthweight based on specific psychosocial interventions, when those interventions demonstrated trends in serial pre/post t-test results.

    Long story short ... responsibility for data collection rested with community health workers. These personnel are highly influential at "moving the needle" with the at-risk population we served, BUT they're typically from the same socioeconomic backgrounds as the client population. Few had anything beyond a high-school diploma. Many spoke English as a second language. They earned little more than minimum wage.

    ... and, the evaluation plan required that each of these workers carry around a floppy disk, each of which contained an SPSS data file that each worker had to manually update every week.

    As you might imagine, this situation wasn't really sustainable. With several hundred clients and nearly two dozen workers -- and one desktop PC with an antiquated version of SPSS loaded -- we ran into considerable difficulty in getting clean data. Workers lost disks (there weren't backups), they invented their own variables, they coded data incorrectly, etc.

    So every six months, when I had to collect all the floppies and stitch together a master SPSS data file, I had to then confront program leadership about anamolies and missing records.

    At one point, the program leadership and I got into a screaming match -- as in, literal screaming -- when I refused to run a collection of t tests whose net missing value rate exceeded 98 percent. They wanted me to certify in the evaluation narrative that certain outcomes obtained over a client population of nearly 200, when certain statistical tests mandated by the grant agreement had an n count of 5 or fewer.

    It wasn't that anyone wanted to defraud the funder. The problem was that the university brain trust that developed the evaluation plan had no real-world experince with community health programs, and the program leadership had no understanding whatsoever about even basic statistical thinking. I think the program design was itself negligent, leading to systemic errors in reporting by my predecessor who didn't object to hinky methodology.

    My successor finally said: Ban those floppies. She got the program leadership to hire an abstractor. Problem solved. :)

    -------------------------------------------
    Jason Gillikin
    Medical Informatics Consulant
    Priority Health
    -------------------------------------------