Discussion: View Thread

logistic regression without a cross-sectional design

  • 1.  logistic regression without a cross-sectional design

    Posted 06-14-2016 16:23
    I am reviewing a manuscript and struggling with whether or not I agree with the author's statistical methods.  Data were retrospectively abstracted from a very large dataset containing about 5 million subjects.  About 2000 of the subjects have the outcome of interest (a binary outcome), these are the cases .    For each case, 100 controls subjects were randomly selected (though not matched).The purpose of the study is to model the odds of the outcome using a variety of factors.

    1): The design changes the prevalence of death from the true prevalence of 2,000 in 5,000,000 (0.0004) to 2,000 in 200,000 (0.01).  So it seems that regular logistic regression and odds ratios may misrepresent the true odds of death due to a factor.  Is there a better way to handle this in the analysis?

    2) The authors compare two models. Both models yield an odds ratio for the outcome that is large (over 80) and both odds ratios are statistically significant p<0.001.  They conclude that the model with the larger OR is better simply because the OR is larger (no goodness of fit statistics,etc.)

    Any thoughts?

           

    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
    419-297-9682


  • 2.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 16:30
    As memory serves, the advantage of an odds ratio is that it is not depended on prevalence, so it can be used for retrospective or case-control studies. 

    But I'd be interested in seeing AUC/C-statistic and AIC for the models.  Also some estimate of optimism/over fitting.  




    --
    Sent from Gmail Mobile


    ------Original Message------

    I am reviewing a manuscript and struggling with whether or not I agree with the author's statistical methods.  Data were retrospectively abstracted from a very large dataset containing about 5 million subjects.  About 2000 of the subjects have the outcome of interest (a binary outcome), these are the cases .    For each case, 100 controls subjects were randomly selected (though not matched).The purpose of the study is to model the odds of the outcome using a variety of factors.

    1): The design changes the prevalence of death from the true prevalence of 2,000 in 5,000,000 (0.0004) to 2,000 in 200,000 (0.01).  So it seems that regular logistic regression and odds ratios may misrepresent the true odds of death due to a factor.  Is there a better way to handle this in the analysis?

    2) The authors compare two models. Both models yield an odds ratio for the outcome that is large (over 80) and both odds ratios are statistically significant p<0.001.  They conclude that the model with the larger OR is better simply because the OR is larger (no goodness of fit statistics,etc.)

    Any thoughts?

           

    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
    419-297-9682


  • 3.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 17:26

    Really need to know some more information.  If you can mention it, it would help to know what the outcome of interest is.  How wide are the confidence intervals around the odds ratio you mention? Was any justification provided for not matching?  Is it likely that the outcomes occur in some particular sub population?  What characteristics of the population are included in the model? 

    I am very suspicious.  I very rarely expect see that large an odds ratio.  It is of the magnitude one might expect looking at lung cancer and smoking.  

    Bob 

    ------------------------------
    Bob Gerzoff, MS PStat®
    Applied Statistical Consulting
    bob@bobgerzoff.com



  • 4.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 18:30

    Just to elaborate on the previous comments, what you have is a classic case-control design. Because the prevalence is artificially controlled, certain statistics like the relative risk are inappropriate for this setting. But Cornfield showed in 1950 that as long as the outcome is fairly rare, the odds ratio is a fine statistic to use. The case-control study was instrumental in first identifying a possible link between smoking and cancer. It also has helped identify that aspirin use was associated with Reye's Syndrome, and that HIV was a sexually transmitted disease. The CDC recently sent a team of epidemiologists to Brazil to conduct a case control study that more firmly established the link between the Zika virus and microcephaly.

    I'm a bit surprised that the authors of the paper you are reviewing did not properly identify their study as a case-control study. Maybe it's because case-control studies have a (largely undeserved) bad reputation.

    Anyway, I would be a bit nervous about an odds ratio of 80. As someone else has already noted, smoking and lung cancer has an odds ratio of around 10 (or maybe 20). The odds ratio for smoking and heart disease is around 2, and that is considered a fairly strong association.

    There are several things that might cause such an abnormally large odds ratio. You might be seeing separation or quasi-separation of the data. If you have a large number of independent variables, you might be seeing overfitting. I like the suggestion of looking at the standard errors. It's hard to imagine that a poor goodness of fit could produce an artificially large odds ratio, though.

    ------------------------------
    Stephen Simon, blog.pmean.com
    Independent Statistical Consultant
    P. Mean Consulting



  • 5.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 18:41

    An Odds ratio of 80 (eighty?) ?   A long time ago (longer than I care to admit) I memorized the odds ratio that Mantel-Haenszel estimated for the risk of lung cancer from smoking in their classic paper - that was about -8-. That was ultimately enough for the Surgeon General to add a warning to cigarettes.  As you say, you are reviewing a manuscript. . You might consider recommending the  Firth correction and asking for the authors to check on convergence and for "separable data". 

    ------------------------------
    Chris Barker, Ph.D.
    Consultant and
    Adjunct Associate Professor of Biostatistics


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy



  • 6.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 19:35
    Graphical examination of the data always should be the first step.  Perhaps this will reveal the reasons behind the strange behavior.
     
    Elizabeth Newton, Ph.D.
    Newton Statistical Consulting
    www.newtonstats.com
    info@newtonstats.com


    ------Original Message------

    Just to elaborate on the previous comments, what you have is a classic case-control design. Because the prevalence is artificially controlled, certain statistics like the relative risk are inappropriate for this setting. But Cornfield showed in 1950 that as long as the outcome is fairly rare, the odds ratio is a fine statistic to use. The case-control study was instrumental in first identifying a possible link between smoking and cancer. It also has helped identify that aspirin use was associated with Reye's Syndrome, and that HIV was a sexually transmitted disease. The CDC recently sent a team of epidemiologists to Brazil to conduct a case control study that more firmly established the link between the Zika virus and microcephaly.

    I'm a bit surprised that the authors of the paper you are reviewing did not properly identify their study as a case-control study. Maybe it's because case-control studies have a (largely undeserved) bad reputation.

    Anyway, I would be a bit nervous about an odds ratio of 80. As someone else has already noted, smoking and lung cancer has an odds ratio of around 10 (or maybe 20). The odds ratio for smoking and heart disease is around 2, and that is considered a fairly strong association.

    There are several things that might cause such an abnormally large odds ratio. You might be seeing separation or quasi-separation of the data. If you have a large number of independent variables, you might be seeing overfitting. I like the suggestion of looking at the standard errors. It's hard to imagine that a poor goodness of fit could produce an artificially large odds ratio, though.

    ------------------------------
    Stephen Simon, blog.pmean.com
    Independent Statistical Consultant
    P. Mean Consulting
    ------------------------------


  • 7.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 20:36
    I agree with the feedback you have received.

    Is the exposure variable with an odds ratio of 80 continuous?  If so, the variable may need to be rescaled so that a 1-unit increase is meaningful. If the exposure variable is binary, I suspect that there is quazi or complete separation of the data.  The bivariate associations of the exposures with the outcome should be reported in a table or figure in the manuscript.  The authors should also report the confidence intervals for each odds ratio estimated from the logistic regression analysis.  

    Amy

    *******************************
    Amy Storfer-Isser, Ph.D.
    Owner and Principal Statistician
    Statistical Research Consultants, LLC



    ------Original Message------

    Graphical examination of the data always should be the first step.  Perhaps this will reveal the reasons behind the strange behavior.
     
    Elizabeth Newton, Ph.D.
    Newton Statistical Consulting
    www.newtonstats.com
    info@newtonstats.com




  • 8.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 18:57
    That is a nice question, Nancy, thanks for sharing it. 

    As someone pointed out, the OR for a case-control design like yours does not depend on the prevalence. The intercept of the fitted model does depend on the prevalence. 

    Certainly an OR of 80 is "far out." We have had those occur, but usually it with a huge confidence interval (like 1 to 1600!). Something has to be wrong. A confidence interval by the bootstrap would be good. Some diagnostics (and descriptives) on variables and variable relationships seem to be called for. At the very least, the authors should be asked to create a bivariate display (e.g., boxplot) of the outcome and the particular independent variable that is smoking the rafters with the 80 OR.   

    Also, with this nice, large sample size, doing a training set to develop the model and then testing it on a test set is a really good way to go. I know the paper is already written, but you could mention that. 

    Good luck!

    Nayak


    Nayak L Polissar, PhD
    The Mountain-Whisper-Light Statistics
    1827 23rd Ave. East
    Seattle, WA 98112
    Tel. 206-329-9325
    Fax 206-324-5915
    polissar@u.washington.edu (for university affairs only) 






    ------Original Message------

    I am reviewing a manuscript and struggling with whether or not I agree with the author's statistical methods.  Data were retrospectively abstracted from a very large dataset containing about 5 million subjects.  About 2000 of the subjects have the outcome of interest (a binary outcome), these are the cases .    For each case, 100 controls subjects were randomly selected (though not matched).The purpose of the study is to model the odds of the outcome using a variety of factors.

    1): The design changes the prevalence of death from the true prevalence of 2,000 in 5,000,000 (0.0004) to 2,000 in 200,000 (0.01).  So it seems that regular logistic regression and odds ratios may misrepresent the true odds of death due to a factor.  Is there a better way to handle this in the analysis?

    2) The authors compare two models. Both models yield an odds ratio for the outcome that is large (over 80) and both odds ratios are statistically significant p<0.001.  They conclude that the model with the larger OR is better simply because the OR is larger (no goodness of fit statistics,etc.)

    Any thoughts?

           

    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
    419-297-9682


  • 9.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 19:03

    One technical comment: in properly designed / sampled population-based case-control studies, the logistic regression coefficients are consistently estimated by either the weighted model (that gives the controls their proper probability of selection weight; a very inefficient method when the sampling rates differ by a orders of magnitude, as in this case, 1:25) or by the quasi-ML that ignores the weights. The only problem for the latter, as Nancy pointed out, is the overall prevalence parameter which is controlled by the intercept, and that one is of course biased if you don't use weights (but you can figure the bias out if you know the sampling rate for the controls). So odds ratios of the factors should be fine. But 80 is indeed a bizarre number.

    I would second the calls for the goodness of fit testing to be done properly.

    Also, I would probably say that an OR of 40 with a CI of 25 to 65 is more convincing to me than the OR of 80 with a CI of 5 to 1000. In other words, you can just as well look at the lower confidence limit in order to decide which is a "better " model.

    Do they described how exactly they subsampled the controls? If that was a random sample, it's probably OK. If that was a carefully matched sample, that's probably OK, too. If these were all last names starting with A (and I did have a sample like that in one of my projects... somebody underestimated the size of the list by a couple of orders of magnitude when they asked to pull each 100th name), that's problematic -- misses all those Russians whose last names tend to start with K ;).

    ------------------------------
    Stanislav Kolenikov
    Principal Survey Scientist
    Abt SRBI
    Education Officer, Survey Research Methods Section



  • 10.  RE: logistic regression without a cross-sectional design

    Posted 06-14-2016 19:04

    Meant to give a reference on case-control stuff: http://www.statcan.gc.ca/pub/12-001-x/2006002/article/9546-eng.pdf

    ------------------------------
    Stanislav Kolenikov
    Principal Survey Scientist
    Abt SRBI
    Education Officer, Survey Research Methods Section



  • 11.  RE: logistic regression without a cross-sectional design

    Posted 06-15-2016 09:25

    If your goal is to predict the outcome, I would use Random Forests on all the data. The authors can find a suitable cut off value then run several of them and see how well the models predict the outcome. 

    Each Random Forest will show the authors which factors have an important impact on the outcome. It also shows the how well the model predicts each type of outcome. 

    There are two big issues with logistic regression that we usually don't think about.

    One is the odds ratio. Suppose the odds ratio shows a 5000% increase in the likely hood of disease. If there is a one in a billion odds of disease anyways, a 5000% increase, though dramatic, the odds of disease are still fairly rare. If you use your logistic regression model to predict the outcome, that person would still have little chance of getting the disease. That means a 5000% increase is pretty meaningless. 

    The other issue is that logistic regressions try to minimize the error. I can predict everyone in the group will not get disease and be correct 99.96% of the time. If you use the logistic regression model to predict the outcome, most of the time it will fail to predict the correct outcome for those with the disease. 

    With the random forests, you tune the model to correctly predict diseased have disease and healthy are healthy. You also end up breaking your data into 2-3 groups. So, you end up with internal validation of the model. By running multiple random forests, you improve the likelyhood of finding truly important factors. It's also like replicating the analysis with "new" data. So, you can get some very robust results.

    Between Random Forests and Logistic regressions, I'll take RF all day every day. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)



  • 12.  RE: logistic regression without a cross-sectional design

    Posted 06-15-2016 09:48
    but if you recommend random forests and the predictors are correlated you have to be careful about interpreting the importance measures which could be biased in the case where you have correlated predictors or predictors that are a mix of categorical and continuous. see strobl et la 2007 




    Trent D. Buskirk, Ph.D.
    Marketing Systems Group
    314-695-1378
    *******************************
    sent from a Galaxy Note 4 
    connected to my Atari 64

    Information contained in this e-mail transmission is privileged and confidential. If you are not the intended recipient of this email, do not read, distribute or reproduce this transmission (including any attachments). If you have received this e-mail in error, please immediately notify the sender by telephone or email reply.


    ------Original Message------

    If your goal is to predict the outcome, I would use Random Forests on all the data. The authors can find a suitable cut off value then run several of them and see how well the models predict the outcome. 

    Each Random Forest will show the authors which factors have an important impact on the outcome. It also shows the how well the model predicts each type of outcome. 

    There are two big issues with logistic regression that we usually don't think about.

    One is the odds ratio. Suppose the odds ratio shows a 5000% increase in the likely hood of disease. If there is a one in a billion odds of disease anyways, a 5000% increase, though dramatic, the odds of disease are still fairly rare. If you use your logistic regression model to predict the outcome, that person would still have little chance of getting the disease. That means a 5000% increase is pretty meaningless. 

    The other issue is that logistic regressions try to minimize the error. I can predict everyone in the group will not get disease and be correct 99.96% of the time. If you use the logistic regression model to predict the outcome, most of the time it will fail to predict the correct outcome for those with the disease. 

    With the random forests, you tune the model to correctly predict diseased have disease and healthy are healthy. You also end up breaking your data into 2-3 groups. So, you end up with internal validation of the model. By running multiple random forests, you improve the likelyhood of finding truly important factors. It's also like replicating the analysis with "new" data. So, you can get some very robust results.

    Between Random Forests and Logistic regressions, I'll take RF all day every day. 

    ------------------------------
    Andrew Ekstrom

    Statistician, Chemist, HPC Abuser;-)
    ------------------------------


  • 13.  RE: logistic regression without a cross-sectional design

    Posted 06-15-2016 22:29

    As described, the design appears to be that of a Case Control study where there are multiple controls for each case.  It is unlikely with such a design that prediction of the event is the ultimate aim of the study. More likely the aim was to assess whether certain variables impact the odds of disease. The use of the logistic in such cases would be a standard practice.

     



    ------Original Message------

    but if you recommend random forests and the predictors are correlated you have to be careful about interpreting the importance measures which could be biased in the case where you have correlated predictors or predictors that are a mix of categorical and continuous. see strobl et la 2007 




    Trent D. Buskirk, Ph.D.
    Marketing Systems Group
    314-695-1378
    *******************************
    sent from a Galaxy Note 4 
    connected to my Atari 64

    Information contained in this e-mail transmission is privileged and confidential. If you are not the intended recipient of this email, do not read, distribute or reproduce this transmission (including any attachments). If you have received this e-mail in error, please immediately notify the sender by telephone or email reply.




  • 14.  RE: logistic regression without a cross-sectional design

    Posted 06-15-2016 09:39

    This is almost certainly a simple issue of over-fitting due to model selection that the final inference ignores.  The conditions are ripe:  your "positive" population is small, and more importantly model selection occurred.  Possibly a good deal of it.  I note your sentence:  "The purpose of the study is to model the odds of the outcome using a variety of factors."  So they went fishing.  And with a large data set with many factors, they don't have to fish much to get horribly biased results.

    See Ambroise & McLachlan, 2002, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data."  Proceedings of the National Academy of Sciences, available for free download from pnas.org.  The problem is different but the principles are the same.

    The final inference needs to encompass model selection.  As a rule of thumb for article reviews, I would suggest that any article reporting a statistical result in which modeling considered five or more variables should specifically describe how model-selection is addressed in inference, and make a convincing case that their methods prevent "selection bias", else they don't get published.  I'm so glad you wrote!  You're doing a wonderful job as reviewer!

    There are multiple ways to make inference that includes model selection.  One possibility is to avoid model selection altogether:  use all variables, and (possibly) apply regularization methods (e.g., the Firth adjustment mentioned earlier) to make the problem computationally tractable.  Another possibility is to apply resampling methods (bootstrapping or cross-validation) to a fitting process that includes variable selection.  This is complicated because one doesn't get the same set of variables for every iteration...but that's exactly the point.  Still another possibility is to use Bayesian Model Averaging, such as with R's BMA package; there are now additional packages, and I couldn't comment on which is best.  Anyway, it's the authors who need to figure out the solution.

    When I was in graduate school, in the days before data mining got big, we would discuss selecting a model and then interpreting the model.  There was always an acknowledgement that p-values and confidence intervals were biased due to model selection, but we sort of waved at the issue as we passed by, and comforted ourselves that with a modest degree of prior analysis, the effect would not be severe.  Fast forward many years, and people are routinely analyzing massive datasets with tens, hundreds, or thousands of variables, and selection bias can be enormous.  I wouldn't be entirely surprised if the true OR your authors find for their factor, once they control properly for model selection, is 1.0, i.e., completely null.  I can tell you I learned of this issue the hard way, by having a problem where I thought the data supported almost complete separation when in fact there was nothing reproducible there.

    The classification and machine learning world has been badly burned by this (just as I was), and the community is now largely sensitized to the issue (I hope!).  The biostatistics community is not as broadly sensitized, presumably because it's less common there to see a large number of predictor variables.  But I think this example is exactly one of those occasions.

    If anyone wants to discuss details, I would suggest spawning a new thread, because that discussion could utterly swamp this one.

    Good luck!

    ------------------------------
    Jim Garrett, PhD
    Sr. Assoc. Dir. of Biostatistics
    Novartis



  • 15.  RE: logistic regression without a cross-sectional design

    Posted 06-15-2016 12:16

    Nancy,

    Here are my thoughts to your questions.


    1):  A case-control design provides unbiased and efficient estimation of odds ratio in a binary logistic regression model. In particular, the slope term of the model represents the log odds ratio and the intercept term corresponds to the disease prevalence. Because the data artificially contains a higher portion of cases, theoretical work shows that the intercept term will be biased and statistical software, such as SAS, allows manual adjustment to the intercept term using "offset" option.  This will be useful if one wishes to compute the prevalence at a given set of predictor values.

    However, if operating outside the binary logistic model, the theoretical foundation for a case-control design collapses. For example, the common probit or cumulative-logit model can not be used on case-control data. And I am deeply concerned with the validity of random forest in such a design.       

    2): Two models could be compared  based on chi-square statistics (with proper DOF) as well as other model criteria. The value of a parameter estimate should not be used to support a model.

    Qing Kang
    The Statistical Intelligence Group


    ------Original Message------

    I am reviewing a manuscript and struggling with whether or not I agree with the author's statistical methods.  Data were retrospectively abstracted from a very large dataset containing about 5 million subjects.  About 2000 of the subjects have the outcome of interest (a binary outcome), these are the cases .    For each case, 100 controls subjects were randomly selected (though not matched).The purpose of the study is to model the odds of the outcome using a variety of factors.

    1): The design changes the prevalence of death from the true prevalence of 2,000 in 5,000,000 (0.0004) to 2,000 in 200,000 (0.01).  So it seems that regular logistic regression and odds ratios may misrepresent the true odds of death due to a factor.  Is there a better way to handle this in the analysis?

    2) The authors compare two models. Both models yield an odds ratio for the outcome that is large (over 80) and both odds ratios are statistically significant p<0.001.  They conclude that the model with the larger OR is better simply because the OR is larger (no goodness of fit statistics,etc.)

    Any thoughts?

           

    --
    Nancy Buderer, MS
    Biostatistician and Research Consultant
    419-297-9682