SSPA Blog: Clean Data, how clean is clean?

By Steve Yao posted 09-21-2011 11:53

  

I have participated several study closeout debrief meetings.  Data quality is one of the hot topics that generate a lot of discussions.  Everyone has different prospects on how to clean the data depends on which functional group you are coming from. 

Data management group will ensure 100% clean for those critical fields and spot check for the rest of fields.  Statisticians would like to see 100% clean on all data fields.  The argument is that if the data is not critical why collect them.  As a statistical programmer, we usually found and reported data issues during the data manipulation process regardless they are critical fields or not. 

There is no right or wrong approach to ensure the data quality.  Over checking the data is always better than under checking but the return from the investment of time may not be big.  So the question is how clean is clean.  My simple answer is when we run out of issues to query then the data must be clean enough.

1 comment
4 views

Permalink

Comments

09-27-2011 23:47

“How clean is clean?” is a great question. I've worked on studies where a 70-page health questionnaire was co-designed by five or more epidemiologists and other researchers, each with their own hypothesis to test. When doing data management, I suspected that data were not 100% clean. However, they were clean enough to move forward, as prioritized by the lead investigator, on a particular analysis leading to a particular publication. Once that analysis was underway, I began focusing on the next hypothesis and analysis and, therefore, the fields that needed to be clean for that analysis. By the time I got to the last priority (i.e., often, the junior researcher's hypothesis), there was little data management left to do, so I'd start analysis immediately. While the last researcher had the longest wait, the last researcher benefited by having the cleanest data and the most informed thinking when it came to analysis and interpretation of results. (I must admit that after 4 to 5 analyses stemming from the same data, I still can't say data were 100% clean....just clean enough.)