ASA Connect

View Only

Back to discussions

Expand all | Collapse all

Splitting the Dataset for Model Building and Testing

David Zucker01-18-2024 08:53

Hi, The way I learned it, when you build a model (eg regression), you split the data into two parts, ...

Andrew Ekstrom01-19-2024 11:43

I gave a talk last week to the Detroit ASA section on "Is Reproducibility Even a Possibility?". I discussed ...

1. Splitting the Dataset for Model Building and Testing

Recommend
David Zucker
Posted 01-18-2024 08:53
Hi,

The way I learned it, when you build a model (eg regression), you split the data into two parts, one for training and one for testing. But I see that in the machine learning world, they distinguish between "test dataset" and "validation dataset", as in the following (taken from https://machinelearningmastery.com/difference-test-validation-datasets/):

Training Dataset: The sample of data used to fit the model.

Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

(some sources, including one on the same webpage, define "validation" and "test" the other way around).

I can understand the idea behind this, but I think there is room for judgment in deciding whether to make the split 2-way or 3-way. For example, if the number of samples is small, a 2-way split could be more appropriate. In other cases, when there are a lot of "tuning parameters" and the number of samples is large, a 3-way split might be the way to go. I'd like to hear the opinion of others on this issue.

With thanks and regards,
David

------------------------------
David Zucker
Department of Statistics and Data Science
Hebrew University of Jerusalem
------------------------------
2. RE: Splitting the Dataset for Model Building and Testing

Recommend
Andrew Ekstrom
Posted 01-19-2024 11:43
| view attached
I gave a talk last week to the Detroit ASA section on "Is Reproducibility Even a Possibility?". I discussed what happens when you use a different random seeds to make your models. I chose to look at Logistic Regression and Decision Trees. With each random seed, you will find some terms are "significant" or "important" for all the models you make, no matter the random seed. Other terms will pop up as being significant or important only for a single or a few of those random seeds. So, trying to hyper-tune parameters, using "Deep Learning", etc, is pretty useless. Because, each model is more of an opinion, than a precise model of what the data tells us. Since each model finds different terms are important, you'll "tune" or "Deep Learn" some "right" terms and some "wrong" terms.

The error rates of each model made, with each random seed, are usually about the same. But, the different opinions each model gives about what is truly important will change each time.

Forgo the parameter tuning. Use an ensemble of results from whatever type of model you use. Just make sure you change the random seed each time.

https://www.youtube.com/watch?v=sYPvCE_au4Q

------------------------------
Andrew Ekstrom

Statistician, Chemist, HPC Abuser;-)
------------------------------

Attachment(s)

D-ASA Meeting 012024.pptx 174 KB 1 version

Original Message

ASA Connect

Splitting the Dataset for Model Building and Testing

David Zucker01-18-2024 08:53

Andrew Ekstrom01-19-2024 11:43

1. Splitting the Dataset for Model Building and Testing

2. RE: Splitting the Dataset for Model Building and Testing

Contact Us

Membership

Privacy

Follow Us

ASA Connect

Splitting the Dataset for Model Building and Testing

David Zucker01-18-2024 08:53

Andrew Ekstrom01-19-2024 11:43

1. Splitting the Dataset for Model Building and Testing

2. RE: Splitting the Dataset for Model Building and Testing

Related Content

RE: Splitting the Dataset for Model Building and Testing

split split plot design (SSPD)

Hiring at the Department of Statistics, Hebrew University of Jerusalem

Karel Moons on Validating Medical Predictive Models

Tuning LLMs for Text Analysis - livestream seminar

Contact Us

Membership

Privacy

Follow Us