ASA Connect

View Only

Back to eGroups

Expand all | Collapse all

Predicted R-Squared

1. Predicted R-Squared

0 Recommend
Uday Jha
Posted 01-02-2017 13:46
I have a data with 100 observation and 1000 variables. I am able to get reduce the variable to less than 100 using glmnet in R. I am able to the value of R-Squared and adjusted R-Squared of about 55% but the value of predicted R-Squared is about 98%. I have also tried ncvreg and SIS in R, but I am unable to get reasonable value of predicted R-Squared. Advice.

Thank you
2. RE: Predicted R-Squared

0 Recommend
David Wilson
Posted 01-03-2017 13:19
Without knowing what you are trying to accomplish, I suggest investigating the use of principal components.

Original Message
3. RE: Predicted R-Squared

0 Recommend
Uday Jha
Posted 01-03-2017 14:47
I am interested in finding the variables significant for the response and predict its influence. Hence Principal Component Analysis is not going to help in this case.

Thank you
------------------------------
Uday Jha
Rochester Institute of Technology

Original Message
4. RE: Predicted R-Squared

0 Recommend
Emil Friedman
Posted 01-04-2017 17:17
With 1000 potential "predictor" variables and 100 observations you can often "perfectly fit" any response, even if the "response" come from a random number generator. In other words, you can "predict" history but those predictions will be completely useless in predicting the results of future experiments.

If you are able to reduce the number of predictor variables to something "reasonable", you might have something useful, but you'll need to work closely with an experienced consultant. R.I.T. should have people like that, but consultants don't work for free.
------------------------------
Emil M Friedman, PhD
emilfriedman@gmail.com
http://www.statisticalconsulting.org

Original Message
5. RE: Predicted R-Squared

0 Recommend
Blaise Egan
Posted 01-05-2017 05:48
Before constructing any model for prediction purposes it is vital that you do an Exploratory Data Analysis.

This involves understanding the distributions of all your variables and whether there are missing values, outliers and particularly influential points. You should also understand the correlations between your variables and would include an exploratory principal components analysis. This may well lead to the conclusion that are your variables fall into a number of distinct groups. If so, you should ask whether you need to have all of them present in your model.

In the first instance you should try to use subject matter expertise to reduce the number of variables. If you have no subject matter expertise there are a few empirical things you can try. You might need to choose a random sample of the variables to get a model that can be estimated with only 100 observations.

I would do a random forests prediction, as that produces a variable importance score that tells you which variables are the key ones from the point of view of predictive accuracy.

A CART model would also be worth looking at - it will throw away most of the variables and use just a few of them.

The LASSO method also picks out a subset of variables and rejects the rest.

One way or another you need to get the number of variables down to something that can be estimated with 100 observations.

------------------------------
Blaise Egan
Lead Data Scientist
British Telecommunications PLC

Original Message
6. RE: Predicted R-Squared

0 Recommend
Chauncey Dayton
Posted 01-05-2017 10:54
The usual recommendation for a MR study is that you have 20 cases per predictor. That means eliminating 995 variables which is unrealistic. Stepwise and other predictor-selection procedures would merely capitalize on chance. If you can select a small number of predictors based on theory (independently of the data), you could do an analysis but it appears that your study is purely exploratory. There is, of course, the issue of how your sample of cases was chosen which raises a whole new set of concerns.
------------------------------
Chauncey Dayton

Original Message
7. RE: Predicted R-Squared

0 Recommend
Uday Jha
Posted 01-07-2017 12:30
Hello,
During the Exploratory Data Analysis, I have checked the missing value, replaced the outliers with mean/median and the distribution in most of the cases is near Gaussian.
I have managed to keep collinearity within 10 in most of the cases by removing the variable when their VIF is more than 30-50. Since I am interested in the significant variables use of principal components analysis may not be useful.
I have managed to reduce the number of variable to less than 100 using glmnet. Since the data are continuous with large number of variable even after reduction and a small number of observation, use of Random Forest and Decision Tree is not able to yield good result. Reduction of variables to a very small value yields a very small value of R-squared.
Thank you

Original Message

ASA Connect

Predicted R-Squared

Uday Jha01-02-2017 13:46

David Wilson01-03-2017 13:19

Uday Jha01-03-2017 14:47

Emil Friedman01-04-2017 17:17

Blaise Egan01-05-2017 05:48

Chauncey Dayton01-05-2017 10:54

Uday Jha01-07-2017 12:30

1. Predicted R-Squared

2. RE: Predicted R-Squared

3. RE: Predicted R-Squared

4. RE: Predicted R-Squared

5. RE: Predicted R-Squared

6. RE: Predicted R-Squared

7. RE: Predicted R-Squared

Contact Us

Membership

Privacy

Follow Us

ASA Connect

Predicted R-Squared

Uday Jha01-02-2017 13:46

David Wilson01-03-2017 13:19

Uday Jha01-03-2017 14:47

Emil Friedman01-04-2017 17:17

Blaise Egan01-05-2017 05:48

Chauncey Dayton01-05-2017 10:54

Uday Jha01-07-2017 12:30

1. Predicted R-Squared

2. RE: Predicted R-Squared

3. RE: Predicted R-Squared

4. RE: Predicted R-Squared

5. RE: Predicted R-Squared

6. RE: Predicted R-Squared

7. RE: Predicted R-Squared

Related Content

Stepwise Regression Attachments

Intercept Value

PREDICTIVE ACCRUAL USING POSTERIOR PREDICTIVE DISTRIBUTION

Division of Biostatistics Seminar

Sensitivity and Positive Predictive Value for a Survey

Contact Us

Membership

Privacy

Follow Us