Discussion: View Thread

Negative Binomial regression: discrepancy between SAS and R

  • 1.  Negative Binomial regression: discrepancy between SAS and R

    Posted 01-09-2014 16:41
    This message has been cross posted to the following eGroups: Statistical Consulting Section and Young Professionals Group .
    -------------------------------------------
    Hello:

    Could you help me understand why SAS (PROC GENMOD) and R (glm.nb) give very different results on this painfully simple dataset:

    Type    Unrounded    RoundedDown    RoundedUp
    T         1.4322                1                 2
    T         1.0785                1                 2
    T         1.6196                1                 2
    N         2.4950                2                 3
    N         2.0104                2                 3
    N         2.3321                2                 3

    This is a two-sample test (T vs N). The last three columns are versions of response. The non-integer response is not a problem because both SAS and R can handle it. The relationship between mean and variance for Negative Binomial is encoded as:

    SAS: Var(mu) = mu * (1 + k * mu)

    R: Var(mu) = mu + (mu^2) / theta,

    where theta = 1 / k.

    The problem is that regardless of which of three possible responses I use, R estimates theta by a large positive number, whereas SAS generates a negative value of k. For instance, regressing Unrounded on Type gives theta = 993077 in R and k = -0.4008 in SAS. Correspondingly, the p-values and AIC are quite different.

    I don't know whether SAS deliberately allows for negative k (underdispersion) or all such results need to be discarded.  SAS also generates a warning "Negative of Hessian is not positive definite".

    Thanks in advance.
    Regards,
    Nik


  • 2.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-09-2014 23:02

    Nik-

    Very interesting question.  Is your objective to understand the difference in the 2 software packages or to analyze the data set?
    It seems clear to me that when you convert the "continuous" data to integers that there is so much loss of information that the negative binomial model or any other discrete distribution model cannot be fit since the values are all the same for each type.  So the negative binomial parameters cannot be estimated and either software package has problems.

    If your goal is to compare the results between the two types, than it is not clear to me what problem contextual information pushes you to round the results instead of simply analyzing the observed data.

    Hope this helps. 

    Best regards,

    -Walt
    -------------------------------------------
    Walter Flom
    -------------------------------------------








  • 3.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 10:13


    As I already mentioned, the problem is very similar regardless of whether one does or does not round the response: R generated a very large positive theta, and SAS generates a negative k.

    Nik




  • 4.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 10:22
    Nik: Fitting the NBD requires data that are "over-dispersed" i.e. variance > mean. I n your example the data are under-dispersed and the two programs finish up with different extreme solutions.  Try a simple example with over-dispersed data (e.g. simulated data from a geometric distribution) and I think you will find almost-perfect agreement.

    -------------------------------------------
    J.Keith Ord
    Professor
    Georgetown Univ
    -------------------------------------------








  • 5.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 11:07
    Prof. Ord:

    I tried to estimate the parameter k for each sample for unrounded response by hand. The formula is k = (Var - mu) / (mu^2) where Var and mu are regular estimates of mean and variance for each sample. This results in k = -0.68 and -0.42. SAS reports k = -0.4008 which looks plausible, but it gives se(k) = 0, and the CI for k has zero length. Essentially, SAS allows for underdispersion, otherwise it's not clear why it delivers p-values and AIC for k < 0. Then, why can't it take it further and estimate the se(k) properly just as well?

    I also looked and the R source and it has a built-in restriction for theta. As soon as theta < 0, it is reset to abs(theta). When there is underdispersion, this results in large positive theta corresponding to k close to zero. In particular, for the unrounded response:

              Theta:  993077
              Std. Err.:  164701639

    which is not very nice either because of that huge standard error.

    What model do I use for underdispersed data? Is there a single model that handles both overdispersed and underdispersed data properly?

    Regards,
    Nik





  • 6.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 14:54
    Hi Nik,

    A few thoughts:

    1. If you are concerned about the legitimacy of the computational algorithms in general, do as Keith Ord has suggested, and manipulate the data so that there is clearly overdispersion.  For example, just spread out the numbers in your upper group and multiply all the results by 10.  See if the results largely agree.  If not, there is a problem.

    2. From a more practical point of view, is this a small toy example or is it (close to) exactly what you want to analyze?  I ask, because you may be demanding a lot to get accurate variance structure estimates from two groups of three observations.  Just for fun, assume normality and compute the usual chi-squared-based confidence intervals for the variance in each of the two groups.  I did this and got intervals of (0.01, 1.49)  for T and (0.01, 1.20) for N.  

    3. In response to your question, "Is there a single model that handles both overdispersed and underdispersed data properly?" the answer is: Yes, it's called ANOVA. :)  In particular, if these are really what your data look like, you have no reason to believe that the variances are different, and you have no evidence of extreme nonnormality.  Keep It Simple...

    -Tom.


    -------------------------------------------
    Thomas Loughin
    Simon Fraser University
    -------------------------------------------








  • 7.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 15:23
    Hello Tom:

    ANOVA assumes no relationship between mean and variance. Assuming that there is a relationship, the problem with NB is that

    Var(mu) = mu + k * (mu^2)

    which can generate a negative variance if k < 0.

    What I had in mind was a model like

    Var(mu) = mu * exp(k * mu^2)

    which should have no problem with k < 0.

    Is something like that available?




  • 8.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 16:24
    Hi Nik,

    I take it that your analysis goal is more to model the variance relationship than to model the mean relationship?  Also, I assume that you have other data sets that are large enough that you can actually estimate variance-model parameters precisely enough to distinguish them from a null model of equal variance.  Otherwise, the ANOVA approach allows you to model means without having to specify the variance relationship, and using a simple unequal variance F-test would allow you to test means and to have different variances without trying to model their differences.  

    In theory, you could specify any variance model you like in conjunction with a normal-based analysis.  In practice, fitting that model to data might require some special steps in existing software.  I know that PROC MIXED in SAS can allow you to specify certain exponential (i.e. loglinear) variance models similar to what you propose.  Whether they can be made to do EXACTLY what you want to propose, I'm not sure.  

    -Tom.

    -------------------------------------------
    Thomas Loughin
    Simon Fraser University
    -------------------------------------------








  • 9.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 16:32
    Actually, GENMOD is probably the place to look for variance modeling.


    -------------------------------------------
    Thomas Loughin
    Simon Fraser University
    -------------------------------------------








  • 10.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-10-2014 10:35
    The fact that you are getting a negative Hessian, usually means that the estimates are worthless.
    I also don't understand your using continuous data with a discrete model.  Both programs ought to
    give you an error right here and refuse to run.  When you say they can handle it, it suggests that
    you have exceeded the assumpltions of the model (and the program) and I am not surprised that
    you have meaningless results. You should be using a lognormal or a gamma for the continuous data.
    You did not include the results - I am rather concerned about the goodness of fit to the NB.  The data may be simple,
    but it does not seem to be Negative Binomial at all - especially if you conceptualize it as an overdispersed Poisson
    or a Poisson-Gamma mixture.  Specifically, there are no 0's, yet there should be.
       I would strongly suggest that you think through what your assumptions are and choose a model that conforms to
    them.  E.g., your discrete models are binary. And examine the diagnostics carefully.

    Ray

    -------------------------------------------
    Raymond Hoffmann, PhD
    Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 11.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 03-12-2014 13:59
    For your second and third response variables, RoundedDown and RoundedUp, the variables are binomial, not negative binomial, so NB is already the wrong distribution to use.  What's worse, if you do a 2x2 cross-tabulation of each response variable versus Type, you will find that both of the off-diagonal cells are zero.  What that means for all maximum-likelihood methods is that your likelihood function will not have a maximum.  Not having a maximum is consistent with the huge estimate and doubly huge standard error you get from R.  What's interesting is that SAS is able to give a negative value of k when the distribution is mis-specified

    For your first response variable, Unrounded, I note that the values are not integers.  For negative-binomial regression, they should be integers, and both your software packages are expecting to see integers.  If your softwares are not giving you error messages when response=Unrounded, then I susptect that your softwares  are truncating the values at the decimal point.  If that is what they are doing, then they are re-creating the RoundedDown response variable.

    -------------------------------------------
    Eric Siegel
    Biostatistician
    Univ of Arkansas for Medical Sciences of Biostatistics
    -------------------------------------------





  • 12.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-13-2014 10:31
    Hello Eric:

    How do you know that RoundedUp and RoundedDown are Binomial?

    Neither SAS nor R expect integers. There is no truncation. If you run the three responses, you will see that the regression coefficients are different in all three cases.

    Nik





  • 13.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-13-2014 10:51
    Just because the software allows non-integers, does not make it correct.
    Binomial and negative binomial only apply to integer outcomes.
    The continuous data means that you are fitting a different statistical process/distribution all together from those two.
    I also strongly agree with one of the previous replies that there is too little data to expect convergence,
    so what you are comparing is how they diverge when there is insufficient data to be able to fit the model.
    Ray

    -------------------------------------------
    Raymond Hoffmann
    Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 14.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-13-2014 11:33
    Do you  have only 6 cases? Or is this just to show a point where the procedures break down?

    What happens is you run you r data in SPSS?

    What happens when you run the data in this tutorial in SA and SAS?


    With the data you posted your DV becomes dichotomous when you round it.
    That is there are only 2 values. Various statistical dialects call this binary, binomial, quantal, flag, or indicator.

    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 15.  RE:Negative Binomial regression: discrepancy between SAS and R

    Posted 01-13-2014 13:47

    The link should be
    http://www.ats.ucla.edu/stat/spss/dae/neg_binom.htm


    correction I mean to ask;

    What happens is you run your data in SPSS?

    What happens when you run the data in this tutorial in R and SAS?
    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------