ASA Connect

 View Only
  • 1.  Data checking and review opinions requested

    Posted 04-15-2020 10:30
    Hi, all:

    I suspect that when we are given a data set by a client, we all start by checking it logically.  We look for out of range values (like a 7 in a 1-5 Likert scale), very unlikely values (5 inch tall individual) and impossible combinations (like 3 years duration in a never smoker).

    Question:  How many errors of that sort (or what percentage of the testable parts) would lead you to ask your client to check the entire data set top to bottom?  Do you have a rule of thumb?

    Of course, the checking we are doing is NOT random and does not directly indicate an error rate, so this is more challenging to discuss than if I knew there were 3% errors.

    My clients usually tell me they did a careful check!

    Would you require a spot check of random results if you found *any* errors?

    Ed

    ------------------------------
    Edward Gracely
    Drexel University
    ------------------------------


  • 2.  RE: Data checking and review opinions requested

    Posted 04-15-2020 11:30
    Edited by Glen Colopy 04-15-2020 11:30
    Hey Ed,

    Typically if I can identify a pattern in the error for a particular variable, I apprise the "data giver" of the error's nature, frequency, and if there are any confounding issues with other variables (that i just happened to come across in the exploratory phase). I do this regardless of frequency.
    I always inform them of the error since it might be indicative of a larger collection problem that they would pick up (as closer domain/dataset experts) on but I am unable to appreciate.

    To answer your question: The cut-off?
    I request a "redo" or correction if the errors are sufficiently frequent that I'd need to create separate models, or analyses.... one model for data with the correct values, and another for the data with errors.

    Obviously this request is not a problem if there's no burden on the client to execute the correction.
    If the burden is large then such a  request indicated that the data analyst does not appreciate the data generating/collection process.

    A caveat example
    So on that note, I'll give one example where a huge number of data points need correction but the burden is low: SQL errors. If your client misunderstands how they want their data to come together and how different features inter-relate then even a very competent data base manager or SQL user can get their query wrong. Someone could be the world's best "SQL joiner"...but if they misconceptualize the data then the data can come out wrong even though the SQL command is executing exactly as they think it is.

    Sometimes this can manifest as missing data. Othertimes it can manifest as incorrect data and wrong feature values.


    ------------------------------
    Glen Wright Colopy
    DPhil Oxon
    Data Scientist at Cenduit LLC, Durham, NC
    ------------------------------



  • 3.  RE: Data checking and review opinions requested

    Posted 04-16-2020 10:44
    Hi all,

    Our group specializes in submissions to FDA CDRH for diagnostic devices, so I require "good stuff in." This way and through internal QC I can stand behind every number, in every document that I approve, that is submitted to FDA.

    What do we look for?
    Our data quality review looks at unexpected missing data (as mentioned by Glen), values in allowable ranges (like Ed), and logical flow of data within each participant - every variable, every participant, because our results are going to FDA.
    For example:
    • We expect everyone to have an AGE in years; for a particular study we might query ages < 40 or > 80.
    • We expect to have RACE for every participant; we expect Other: Specify to be present when RACE = Other and missing when RACE is not Other.
    Our focus is mainly on variables that will be used in analyzing study endpoints, but we look at things like AGE and RACE for tables of demographic and clinical characteristics that are required in all of our studies.

    When do we go back to the client?
    Like Glen, we have zero tolerance: Every query goes to the client, even if there are only a few, because based on timing of data collection and data transfer we may be able to alert clients to issues in time to correct them moving forward; also, because we need to stand behind our results provided to FDA. Clients must then respond confirming that the data matches source documentation (Yes, that person was 82 years old) or with updated values (Oops, Other: Specify was self-reported as Martian and we dropped it because we are all Earthlings, we will put it back into the database for you).

    If there is a group of related errors (as mentioned in Glen's caveat example) that do actually reflect source (Well, we didn't collect Other: Specify because it won't be a row in the table) we make an executive decision (Note this in a table footnote). Another example is date of informed consent, which can be off by a year around December and January - if procedure dates indicate this, we can make an executive decision to correct them in our analysis datasets (and comment on this in our programs for making analysis datasets).

    When do we require an updated data transfer?
    This is based on importance of the variables with errors to the analysis, likelihood of receiving corrected data (corrected years of consent very unlikely; corrected errors in variables needed to evaluate endpoints required), project timelines, and experience with similar types of data and/or this client's earlier studies.

    Alicia

    ------------------------------
    Alicia Toledano
    President
    Biostatistics Consulting, LLC
    ------------------------------



  • 4.  RE: Data checking and review opinions requested

    Posted 04-16-2020 11:40
    Thanks to Alicia and Glen for you comments. But you both are focusing on a slightly different question from what i am asking (although I can glean a bit of what I want from what you wrote).

    Let me rephrase and offer an example.

    First of all, I also have zero tolerance for errors and impossible values.  The client must either provide me the corrected value or admit that it cannot be corrected. An example of the latter would be an anonymous data collection form in which the subject checked "No" for "Ever smoking" but then gave a duration.  This requires a decision, which *might* be to defer to the duration, or might be to declare both missing.  But none of these are left alone. Or a subject who wrote in "150" under age.  That becomes missing data if there is no way to determine the correct value.

    So, an example:  200 subjects, each with 5 Likert scales.  Of the 1,000 observations, 4 are out of range (6, 9, 7, and 0 on a 1-5 scale). Obviously I either fix those or remove them. But what does this say about the rest of the values? These are only the errors I can see!  If people are typing in impossible values and not noticing them, how many 5's are really 2's? How many 1's are really, well, 2's, 3's, 4's or 5's?  Would you ask for a recheck of the entire 1,000 observations if that could be done easily (say from nicely available paper forms?)  Would you ask for a spot check of 200 values, to see how many errors there were, with a complete recheck if more than (???) % found?  Any? What rule do you apply?

    I realize that the rule will depend on non-data driven features like how critical some percentage of errors would be in the data.  In many Likert sets like this, probably some small but non-trivial number of respondents are answering carelessly and giving meaningless responses even if they are entered correctly, So does 1% outright error make a difference?  But what about ages in a study where age is a key predictor? Duration of smoking in a study of lung diseases?

    I also recognize, as Glen emphasized, that the difficulty of correction may matter.  If the data is buried in hard-to access clinical paper records that take an hour to find each value, spot checking hundreds may not be possible.

    Thanks again!

    Ed

    ------------------------------
    Edward Gracely
    Drexel University
    ------------------------------



  • 5.  RE: Data checking and review opinions requested

    Posted 04-16-2020 19:17
    With all surveys, there are issues of what you asked, what you think you asked and what the survey taker thinks you asked. 

    For example, if you asked my mom if she smoked, she would tell you she didn't...... at the time of the survey. If you asked her for how long did you smoke, she would tell you 25 years. So, I would not through out "wrong" data like that because it doesn't make sense to your interpretation of the question. 

    During a survey I gave, I had a rating scale of 1 to 10. 1 was "So horrible you vomit" (we had some 2's) all the way to 10 was "the best thing you ever had". That 10 was interpreted as, "This sample was as good as the best thing I ever had." So, when it was better than the best thing, it got a higher reading. 

    I ended up using each rater as a block. That helped eliminate rate to rate variability and interpretation of the rating scale. 

    At another job where I had no control over how questions were asked, nor how data was recorded, we had similar issue to what you are describing. Our scale was 1 to 5 starts, 5 being best. We would get, "I HATE EVERYTHING AT YOUR FACILITY!!!! - 5 stars" to "Thank you for making the day wonderful for me and my family. - 1 star" If you eliminated those "obviously wrong" answers from all the surveys we had, you might have 5-6 valid responses per month per category. 100+ responses per category overall. But, only 5-6 valid ones. 

    Even with all those issues, we were still able to keep track of how well we were doing in the various categories. And, at least I loved the results. It showed the areas I worked in a lot that month had 1-2 point improvements in satisfactions scores. 

    Some of the other survey's I worked with had even more annoyences than those.  I ended up telling the PI, "Here are your results. Here are a lot of reasons why I think they are bogus. Here is what I would do to make it better next time. Sign here so I can get paid." 

    One other thing that you touched on, "How valid are the values?", well..... I worked as a chemist for about 10 years. I was the one in the lab, running the instruments that produced results, like concentrations of chemicals in solution. With every instrument I have ever used, and I do mean EVERY, there is an issue where the baseline increases as more samples are run. For example, on one of my instruments, the internal standard (IS) would increase in reported concentration from the first sample to the last. The IS had to be between 70% and 130% and no corrections were made to modify the reported concentrations.  So, if my instrument reported that the IS was, 70%, 72%, 75%, 78%, 80%, ...... 120%, 123%, 24%, 126%, 129%, in that order, they "passed" the QC check. The regression model for those values was something like: (Reported Conc) = 68% +(3%)*run position. After discussing the issue I saw with the QC mangler.... I mean manager, I was fired. I saw those same issues on another 16+ instruments I used. When I was in charge of QC, I made corrections. When I wasn't, I quit after a while because I didn't want to be the one responsible for making the poisoning of people possible. And these are the labs, companies turn to when they need to know if there are toxins (and real toxins like PCBs, PAHs, Cyanide, lead, mercury, etc) in the sample. 

    Based upon my experiences with chemical analysis equipment, and knowing the right questions to ask to get the truth out of chemists, I would think most of the values you get have a large, unknown and uncared about, bias. Suppose that there is a cut off of say 1ppb. Above 1ppb, the sample is toxic and remediation is needed. Below 1ppb, it is "safe". If the sample reads say 0.9ppb to 1ppb, then we might write to the company and tell them it is "close but safe". Some of the instruments I used would take a 0.80ppb sample and report it as 1.03ppb. We would then repeat that sample the next day and find it was 0.55ppb (amazing I know). That 0.55ppb sample would then be reported. A sample that should read 1.2ppb comes out as 0.876ppb and is reported as safe. Since it was not above our 0.90ppb threshold for a mild warning, nothing was reported. And sadly, there are many QC manglers at a lot of the companies like those I worked for. 

    In fact, one local to me pharmaceutical company was so bad, that the FDA shut them down twice! Their QC was that BAD!!!!! AND THEY COULDN'T TELL!!!! They fought and lost both times. Then they got bought out by another company with real QC managers. The company was then sold elsewhere and finally shut down permanently. The former QC mangler now has a job elsewhere, mangling QC for a different chemical company.  


    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------



  • 6.  RE: Data checking and review opinions requested

    Posted 04-17-2020 07:54
    Thanks, Andrew:

    Don't think I've ever heard horror stories of bad data quite like yours.  5-6 valid out of 100?  Hm.

    Fortunately, I don't think any of my datasets are that bad!

    The question of what to do when you know MOST of the data is garbage is different from what to do when you think most of the data is good and want to minimize the errors in coding it.

    Ed

    ------------------------------
    Edward Gracely
    Drexel University
    ------------------------------



  • 7.  RE: Data checking and review opinions requested

    Posted 04-17-2020 13:01
    If you knew how the data was entered, it made sense. The mgr of the facility would call say 40-50 people per month and ask them, "How was your trip to our facility? (On a scale of 1-5) How was the lock room?......

    He would then put that info into a spreadsheet. It had headers like:

    Customer      Overall      Comment     Locker Room    M/F    Comment    Spa      Comment......


    When the mgr would "enter" the data, he would routinely choose the wrong cells in the spreadsheet. So, you might find that customer 17 used the men's locker room and 2 other areas of the gym. Suppose this person gave the men's locker room a 2. Customer 18 used the women's locker room (Gave it a 4) and say the spa. But, there would be a comment next to the women's locker room that the urinals were disgusting. 

    For purposes of NPS scores, this type of sloppy record keeping was good enough. The comments didn't line up. But, NPS scores don't use comments. So, 1 star and "Things were AWESOME!! Keep up the good work." vs 5 star and "There was feces smeared on the walls!", give you the same NPS score as if they were lined up properly. 

    Sometimes we would have say 40 Overall ratings and 10 to 20 ratings in each column making it look like 10 to 20 people did everything at the gym. Spa, Weight room, cardio area, mens locker room, womens locker room, day care, wall climbing, children's party, pool, sauna, etc. Of course, nothing lines up. But, we would still get our NPS scores and they tended to be accurate. For purposes of digging deeper...... well..........

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------