Discussion: View Thread

  • 1.  Data Cleaning Tasks

    Posted 03-28-2013 11:15
    This message has been cross posted to the following eGroups: Statistical Consulting Section and Statistical Programmers and Analysts Section .
    -------------------------------------------
    Hello,

    We have a custom MS Access based data-entry application that has some detected & presumably un-detected errors.  I'm interested in preparing this dataset to be shared outside our organization. 

    I was wondering what specific "data-cleaning" tasks you'd employ once you've exported the double-entered dataset from the application?  

    Also, I'm curious where you would draw the line as to what is my organization's responsibilities in proving the dataset and what responsibilities belong to the researchers that use it? 

    -------------------------------------------
    Emmeline Sangeorzan
    Biostatistician
    Arthritis Research Institute of America
    Clearwater, FL
    -------------------------------------------


  • 2.  RE:Data Cleaning Tasks

    Posted 03-28-2013 13:53
    Hi Emmeline,

    Interesting question.  I've seem many statisticians handle this differently.  I worked in big pharma for quite a while and I always tried to stay close to the data management and data entry folks.  I used to tell them that they do a wonderful job in cleaning the database but us statisticians look at the data in a differeent way.  For example my explanation is that if you think of the data as a forest.  The data management folks often do a great job and pruning each individual tree.  But when they are done we statisticians have the opportunity of standing on a hill and overlooking the forest of trees and spotting outliers or values that are different.

    So what do or should you do to help clean the data?

    Some companies have SOPs addressed to this, but here is what I found to be helpful.  Now sometimes this changes from study to study and the length and size of the study.  Some work for smaller studies and some other things work for larger studies.

    1. Look at the listings.
        Look for gaps in the data.
        Look for strange findings.
    2. I like to use Proc Univariate on all the numeric values with plot option on to check for outliers or strange values.
    3. Plots - one value against another. Possibly for important variables.
    4. Freq Tables for categorical values.  See if the values all make sense.
    5. Finally - I like to send all the data through the planned analyses prior to unblinding to see if anything kills the analyses.  I would hate to see this happen after datalock.

    These are just some ideas on how a statistician can help in the data cleaning.

    Now your last statement is interesting.  My feeling is that your organizaiton is only responsible for providing the data as it was intended to be used in the protocol and SAP.  Nothing more.

    I hope this helps,

    -------------------------------------------
    Rocco Brunelle
    Senior Statistician
    Bowsher Brunelle Smith LLC
    -------------------------------------------








  • 3.  RE:Data Cleaning Tasks

    Posted 03-28-2013 17:24
    Of course a lot depends on the nature of the data and the nature of the audience to whom you are communicating the data.

    If you have access to SPSS, I would suggest starting with it. SPSS. SPSS files contain more extensive meta data than most other stat package formats.  Then it is easy to save the data in other formats although some will lose some of the metadata.

    If you do not have SPSS, the notes below should suggest what you need to put in the documentation and what to do to clean the data.

    Below are some notes a made for clients a few years ago.

    When you have all of the data in the data view, make sure that the variables view is communicative to your audience. Use meaningful variable name.  Provide meaningful labels for each variable.  Define displays formats that improve readability, e.g., no excess decimal places, use date formats for dates, currency formats for money, etc.  Define which values indicate missing data.  Provide labels for the different missing value codes.  For variables where they apply provide labels for values.  Fill in the level of measurement.

    Have a cold reader look over your dictionary.

    Use the menus to write syntax to find duplicate cases.  Even if the data is to be anonymized, be sure that each case has a unique ID so that users can report problems with the data

    Use menus to write syntax to find unusual cases.

    If you have string variables that should have limited numbers of values, but the data was enter by people, use AUTORECODE to find variations in capitalization, misspellings, and spacing.

    Use menus to write syntax to create data validation rules.


    Write syntax to write validation checks, e.g., whether attitude items are all given the same answer.


    Run frequencies and descriptives to make sure that values are legitimate, e.g., that there are no unlabelled or illegitimate codes' that heights; weights, IQs, etc. are reasonable;  that there are not unusual gaps or heaps in the distributions, etc.


    Run crosstabs, correlations, and visualizations such as parallel coordinate plots, rotatable 3D scatterplots, etc. To look for cases with suspicious values given the values of other values. 72 inch tall 9 year olds, pregnant males, etc.
     
    If you have items for summative scales, use the RELIABILITY procedure to check on the scoring keys.

     

     YMMV, but in my experience,  the vast majority of suspicious values are due to data entry.


    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 4.  RE:Data Cleaning Tasks

    Posted 03-30-2013 07:46
    Specific data cleaning really depend on the study. The organization is responsible to give the data as it was intended to use otherwise specified in the contract.
    But basically, you have to:
    -Write a data quality evaluation report to release with the data. The report generally includes the survey methodology, sampling, measure and possible sources of errors, treatment made on data and methods used);
    -Identify missing and inconsistent data;
    -Identify outliers for almost any variables of interest;
    -Make sure the confidentiality of respondents is not violated before releasing the data.



    -------------------------------------------
    Judith Kom Nguiffo
    Independent Consultant
    -------------------------------------------