Discussion: View Thread

Re: Testing for correlation in continuous and categorical variables with missing values

  • 1.  Re: Testing for correlation in continuous and categorical variables with missing values

    Posted 10-18-2011 13:01


    -------------------------------------------
    Wayne Haythorn
    -------------------------------------------

    For relationships between continuous variables, you can use the gamma statistic, described at 
    http://smoothregression.com/PDFs/NewTools.pdf

    Gamma estimates mean squared error, relative to an unknown function, if that function is differentiable with finite first and second derivatives.  

    As an estimator of mean squared error relative to a function, gamma is analogous to a residual in linear regression.  The difference is that with gamma you don't have to make any assumptions about the algebraic form of the relationship.  It will measure noise relative to any function which is continuous and smooth.  So it tells you which inputs are useful to predict the output, and it does this in a sort of absolute way, since the measure is relative to any smooth model.

    It doesn't work with category data, and if you know that the relationship is linear, you are better off using linear regression.  Given a linear relationship, you can get a good estimate of the error with about 30 data points using linear regression, and you will need about 60 data points to get the same result using gamma.  Of course, if you are wrong in your assumption of linearity, gamma will still give you an accurate estimate of the noise, and linear regression will not.