ASA Connect

 View Only
  • 1.  Least-Squares Regression modeling alternative

    Posted 05-25-2020 13:40

    This is my first post here

    I have a set of data in which 70% of the model's errors are between 0 and 1, and the rest are between 1 and about 9.   The loss function implied by least squares  possesses an undesireable quality for a loss function for my application-- as the  errors less than 1 become smaller (as they are squared) and the errors above 1 become significantly larger.  

    In my application  a linear loss function (min Σǀy- ŷǀ )  seems more appealing.  


    I have never tried to estimate a model using a linear  loss function, which as I recall (from many, many years ago) is a linear programming problem.   We use R currently, but have not investigated this particular approach in R--but we are open to using other software.  Is R a good tool for this kind of modeling? 

    The data set will have between 2-4 thousand observations on around 15 variables.

    I would appreciate any recommended readings on the subject that are not too technical.

    Also, I would like to ask a very naive question.  I am curious about non-regression based machine learning approaches,  are they primarily heuristics?  Do these methods  have implied loss functions?       

    Thanks
    Jim Hawkes
    jhawkes@hawkeslearning.com

    ------------------------------
    James Hawkes
    Retired
    ------------------------------


  • 2.  RE: Least-Squares Regression modeling alternative

    Posted 05-26-2020 07:16

    Hi Jim

    R is most comprehensive repository of stats software that exists anywhere. It's good for pretty much any type of data analysis where the data set can reside in the computer's memory. (Even if not, you can rent cloud-based 'virtual machines' with as much memory as you want.) A few thousand observations is no trouble at all.

    The thing you're talking about is Least Absolute Deviation regression. A bit of Googling tells me that the L1pack package would seem to do what you want.

    However, you might want to consider other approaches. You haven't said what your model is, that generates these skewed residuals. I would be tempted to first try a Generalised Linear Model with an exponential or gamma error term.  For that you don't need to call any packages as it's all in base R via the glm() function

    If the covariates relate to the response in a non-linear way then I would go for a Generalised Additive Model, which uses spline functions to model the right hand side instead of a linear predictor. I have had good success with the mgcv package and the associated book on Generalised Additive Models by Simon N Wood.

    "I am curious about non-regression based machine learning approaches,  are they primarily heuristics?  Do these methods  have implied loss functions?"


    I am not sure what you mean by "non-regression based". To my mind, if there's a response variable that you're predicting with a bunch of covariates that's a regression, even if it doesn't look like anything like a linear model on the right hand side.

    You might want to look at CART tree models, in the package rpart. They recursively partition the data and build a tree model. If the tree ends up not being too complicated that can be very helpful in understanding your data. The big issue with them is instability. Add some more data and you may get a completely different tree.

    Very good predictive accuracy is achieved by combining the results of many trees, as in the randomForest package. However,  the weighted average of 100 or so different trees defies interpretation, so it not easy to explain the prediction in simple terms.

    So No, ML methods are not simple heuristics. 

    Blaise



    ------------------------------
    Blaise Egan
    Lead Data Scientist
    British Telecommunications PLC
    ------------------------------



  • 3.  RE: Least-Squares Regression modeling alternative

    Posted 05-26-2020 11:50
    Hi Jim. 

    It will help the discussion if you can please tell us more about the problem you are working on and the nature of the data. At this point, any suggestions I make are speculative.  

    I do agree with the sound advice @Blaise has given. ​​

    When faced with problems like you describe, I go back to the beginning and examine the data set more carefully using EDA methods. For example, I would try to get a better feel for how the variables are distributed. Also, are categorical variables coded in a sensible manner, or is some work require? For numeric variables, does a Box-Cox transformation help? 

    Another thing you should consider is expand the diagnostics you have used on the fitted values. Have you done a residual plot, an influence plot, etc? Perhaps your problem is heteroscedasticity of the residuals? R has lots of tools to help you with this. 

    The exact steps will depend on what you find in your initial examination of your data and model results. Some creativity may be required. 

    Hope my comments help. Steve

    ------------------------------
    Stephen Elston
    Principle Consultant
    Quantia Analytics, LLC
    ------------------------------



  • 4.  RE: Least-Squares Regression modeling alternative

    Posted 05-26-2020 11:53
    For a linear loss function, loss is minimized by a median. So simply calculating a median will work as a univariate statistic, and quantile regression for a model.

    ------------------------------
    Jonathan Siegel
    Director Clinical Statistics
    ------------------------------



  • 5.  RE: Least-Squares Regression modeling alternative

    Posted 05-26-2020 12:02
    I was also going to recommend quantile regression. There is the quantreg package in R or my advisor's package randomForestSRC, which has implemented quantile regression forests.

    ------------------------------
    Robert O'Brien
    ------------------------------