ASA Connect

 View Only
  • 1.  Predicted R-Squared

    Posted 01-02-2017 13:46
    I have a data with 100 observation and 1000 variables. I am able to get reduce the variable to less than 100 using glmnet in R. I am able to the value of R-Squared and adjusted R-Squared of about 55% but the value of predicted R-Squared is about 98%. I have also tried ncvreg and SIS in R, but I am unable to get reasonable value of predicted R-Squared. Advice.

    Thank you


  • 2.  RE: Predicted R-Squared

    Posted 01-03-2017 13:19

     

    Without knowing what you are trying to accomplish, I suggest investigating the use of principal components.






  • 3.  RE: Predicted R-Squared

    Posted 01-03-2017 14:47

    I am interested in finding the variables significant for the response and predict its influence. Hence Principal Component Analysis is not going to help in this case.

    Thank you

    ------------------------------
    Uday Jha
    Rochester Institute of Technology



  • 4.  RE: Predicted R-Squared

    Posted 01-04-2017 17:17

    With 1000 potential "predictor" variables and 100 observations you can often "perfectly fit" any response, even if the "response" come from a random number generator.  In other words, you can "predict" history but those predictions will be completely useless in predicting the results of future experiments.

    If you are able to reduce the number of predictor variables to something "reasonable", you might have something useful, but you'll need to work closely with an experienced consultant.  R.I.T. should have people like that, but consultants don't work for free.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org



  • 5.  RE: Predicted R-Squared

    Posted 01-05-2017 05:48

    Before constructing any model for prediction purposes it is vital that you do an Exploratory Data Analysis.

    This involves understanding the distributions of all your variables and whether there are missing values, outliers and particularly influential points. You should also understand the correlations between your variables and would include an exploratory principal components analysis. This may well lead to the conclusion that are your variables fall into a number of distinct groups. If so, you should ask whether you need to have all of them present in your model.

    In the first instance you should try to use subject matter expertise to reduce the number of variables. If you have no subject matter expertise there are a few empirical things you can try. You might need to choose a random sample of the variables to get a model that can be estimated with only 100 observations.

    I would do a random forests prediction, as that produces a variable importance score that tells you which variables are the key ones from the point of view of predictive accuracy.

    A CART model would also be worth looking at - it will throw away most of the variables and use just a few of them.

    The LASSO method also picks out a subset of variables and rejects the rest.

    One way or another you need to get the number of variables down to something that can be estimated with 100 observations.

    ------------------------------
    Blaise Egan
    Lead Data Scientist
    British Telecommunications PLC



  • 6.  RE: Predicted R-Squared

    Posted 01-05-2017 10:54

    The usual recommendation for a MR study is that you have 20 cases per predictor. That means eliminating 995 variables which is unrealistic. Stepwise and other predictor-selection procedures would merely capitalize on chance. If you can select a small number of predictors based on theory (independently of the data), you could do an analysis but it appears that your study is purely exploratory. There is, of course, the issue of how your sample of cases was chosen which raises a whole new set of concerns. 

    ------------------------------
    Chauncey Dayton



  • 7.  RE: Predicted R-Squared

    Posted 01-07-2017 12:30

    Hello,

    During the Exploratory Data Analysis, I have checked the missing value, replaced the outliers with mean/median and the distribution in most of the cases is near Gaussian. 

    I have managed to keep collinearity within 10 in most of the cases by removing the variable when their VIF is more than 30-50. Since I am interested in the significant variables use of principal components analysis may not be useful. 

    I have managed to reduce the number of variable to less than 100 using glmnet. Since the data are continuous with large number of variable even after reduction and a small number of observation, use of  Random Forest and Decision Tree is not able to yield good result.  Reduction of variables to a very small value yields a very small value of R-squared. 

    Thank you