ASA Connect

 View Only
  • 1.  Splitting the Dataset for Model Building and Testing

    Posted 01-18-2024 08:53

    Hi,

    The way I learned it, when you build a model (eg regression), you split the data into two parts, one for training and one for testing. But I see that in the machine learning world, they distinguish between "test dataset" and "validation dataset", as in the following (taken from https://machinelearningmastery.com/difference-test-validation-datasets/):

    • Training Dataset: The sample of data used to fit the model.
    • Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
    • Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

    (some sources, including one on the same webpage, define "validation" and "test" the other way around).

    I can understand the idea behind this, but I think there is room for judgment in deciding whether to make the split 2-way or 3-way. For example, if the number of samples is small, a 2-way split could be more appropriate. In other cases, when there are a lot of "tuning parameters" and the number of samples is large, a 3-way split might be the way to go. I'd like to hear the opinion of others on this issue.

    With thanks and regards,
    David



    ------------------------------
    David Zucker
    Department of Statistics and Data Science
    Hebrew University of Jerusalem
    ------------------------------


  • 2.  RE: Splitting the Dataset for Model Building and Testing

    Posted 01-19-2024 11:43
      |   view attached

    I gave a talk last week to the Detroit ASA section on "Is Reproducibility Even a Possibility?". I discussed what happens when you use a different random seeds to make your models. I chose to look at Logistic Regression and Decision Trees. With each random seed, you will find some terms are "significant" or "important" for all the models you make, no matter the random seed. Other terms will pop up as being significant or important only for a single or a few of those random seeds. So, trying to hyper-tune parameters, using "Deep Learning", etc, is pretty useless. Because, each model is more of an opinion, than a precise model of what the data tells us. Since each model finds different terms are important, you'll "tune" or "Deep Learn" some "right" terms and some "wrong" terms. 

    The error rates of each model made, with each random seed, are usually about the same. But, the different opinions each model gives about what is truly important will change each time. 

    Forgo the parameter tuning. Use an ensemble of results from whatever type of model you use. Just make sure you change the random seed each time.   

    https://www.youtube.com/watch?v=sYPvCE_au4Q



    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------

    Attachment(s)

    pptx
    D-ASA Meeting 012024.pptx   174 KB 1 version