ASA Connect

 View Only
  • 1.  Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-18-2018 09:27
    Hello

    Can anyone share any information on the use of AI or ML in data cleaning? To clarify, I am referring to the cleaning of data (missing values, outliers, nonsensical values et al) prior to model development. Also the model can be anything - statistical ( say a regression based model) or mathematical.

    Not sure if the question posed above is too generic but any inputs will be much appreciated!!! Have a great day!!

    Thanks, Sayan


  • 2.  RE: Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-19-2018 08:17
    Hi Sayan,

    That question makes perfect sense to me!

    One obvious example is biomedical data, where you have several inferential steps on the "rawer" forms of the data prior to the using anything for clinical inference.

    If you imagine a vital sign time series, for example from a bedside monitor, the data display is rife with artefacts. You may wish to remove these artefact before inference on clinical condition. On the simplest end of the spectrum  your "ML" algorithm could just be a running mean/median which smooths out vital signs artefacts. On the more complex end of the spectrum, it could be a Gaussian process regressor (or choose your favorite ML classifier) that compares the variance/covariance of each measurement to it's neighbours in the time series (and across multiple features) to detect novelty or an unusual dynamics.

    So in this, you've done quite a bit of ML prior to the primary clinical inference. On top of that you could also examine the ML techniques used before  (ex. in the signal processing to derive vital sign estimates from the raw waveforms), and after (to detect conditions like atrial fibrillation, hypoxia, general deterioration, etc.)

    Here are some links below to continue your search. A google search of "novelty detection" is also a good idea.

    GPs for artefact identifcation:
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.8305&rep=rep1&type=pdf

    Plenty of ML here to deal with artefact and noise:
    https://www.ncbi.nlm.nih.gov/pubmed/25570111
    https://physionet.org/challenge/

    A low computation example for embedded devices:
    http://www.robots.ox.ac.uk/~sjrob/Pubs/embc2017.pdf

    A great summary article on novelty detection (which may or may not be used for artefacts)
    http://www.robots.ox.ac.uk/~davidc/pubs/NDreview2014.pdf

    Hope this helps!

    ------------------------------
    Glen Wright Colopy
    DPhil Student
    University of Oxford
    ------------------------------



  • 3.  RE: Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-21-2018 09:11
    Hi Glen

    Many thanks for the response !!! The information and the links you have shared look very interesting although I confess I haven’t really had a chance to reply to your reply ( really appreciate your taking time out for this !! ) or go through the materials in detail. Once I do that probably I will have follow up queries. Have a great weekend !!! Cheers, Sayan




  • 4.  RE: Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-21-2018 10:12
    Hey Sayan,

    No problem!

    For what it's worth, there are also plenty of examples of ML being used to filter/smooth/interpolate missing data so that it can be passed to a more "traditional" statistical analysis.

    For example, consider a system where the front-end "must" use an AR model (for some reason, e.g., legacy software design). The AR requires data at consistent time points going back and therefore must find a way to handle missingness. The researchers than use their favorite predictive interpolator on the back-end to fillin those missing values. So the development of that interpolator can go through all the rigor** of traditional ML training/testing/validation. But ultimately, the ML was just one step in a process that creates the data. The final use of that data may or may not use ML techniques at all.

    I've seen examples where the final method was something as familiar as a Cox PH regression, but also for feeding into LSTM neural nets. So the idea is that using some fancy predictive techniques in one stage really doesn't force people to use anything fancy for their final analysis.

    Hope that helps,
    Glen


    ** or sometimes lack-thereof!

    ------------------------------
    Glen Wright Colopy
    DPhil Student
    University of Oxford
    ------------------------------



  • 5.  RE: Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-22-2018 11:23

    Google for "anomaly detection" and you may find lots of interesting material. Modeling what is "normal", and chasing down "abnormal" points that the model spits out, is what works for me. 

     

    Patterns within the anomalies associated with recording errors or extraneous sources may be handled by filters.  Otherwise the process which is generating the data might have quality control issues or the "normal" state may not be completely understood, for example.






  • 6.  RE: Automate Data Cleaning using Machine Learning or Artificial Intelligence

    Posted 01-22-2018 12:42
    Thanks Eric !!!

    Sent from my iPhone