Discussion: View Thread

Back to discussions

Expand all | Collapse all

weighting in logistic regression

1. weighting in logistic regression

Recommend
Herbert Weisberg
Posted 10-24-2011 14:31
I would like to run a logistic regression model, but to apply results to a somewhat different sample. Essentially, I want to re-weight the data so the fitted model will be theoretically applicable to a population with a different proportion of females from the analytic sample. My main concern is the model itself, not the sampling variation. That is, the resulting model should be unbiased, although the reported coefficient standard errors might be misleading. Will applying the WEIGHT option in SAS accomplish this? If so, how does it actually do the weighting within the procedure? (I mean mathematically, not how to program). If not, is there a better approach?

-------------------------------------------
Herbert Weisberg
President
Correlation Research, Inc.

-------------------------------------------
2. RE:weighting in logistic regression

Recommend
Stephen Simon
Posted 10-24-2011 17:24
If this is individual patient data, get a predicted value for males, a predicted value for females (each at the average value for the remaining covariates) and then calculate a weighted average by hand. For example, if there are 9 males for every female in your new sample take 0.9 times the male predicted value plus 0.1 times the female predicted value. If there are interactions involving sex, it gets a bit trickier, but you shouldn't have to use the WEIGHT option in SAS for any of this.

If you want the slope coefficients for the remaining covariates after you adjust the sex ratio, those should stay the same unless there is an interaction present. So a log odds ratio of 0.5 for an exposure variable means a log odds ratio of 0.5 for the exposure variable among males and a log odds ratio of 0.5 for the exposure variable among females, unless there is an interaction. So there is no need to weight anything here.

-------------------------------------------
Stephen Simon
Independent Statistical Consultant
P. Mean Consulting
-------------------------------------------
3. RE:weighting in logistic regression

Recommend
Jonathan Shuster
Posted 10-25-2011 08:57
Sometimes we spike the sample, as in case-control studies. If we have an estimate of prevalence, we can use Bayes Theorem to estimate the probability of disease given the value of the covariate. The estimate can take into account the errors in estimating the prevalence rate and the coefficients in the logistic model. The output is the probability of disease given the values of the covariates via point and interval estimates.

-------------------------------------------
Jon Shuster
University of Florida
-------------------------------------------

Original Message
4. RE:weighting in logistic regression

Recommend
Emil Friedman
Posted 10-25-2011 09:57
Weighting the regression does not sound legitimate here. My understanding of weighting in OLS is that one uses it when one has reason to believe some of the data are more precise than other data. I would presume that the same logic applies to logistic regression.

Steve Simon's approach sounds more reasonable but it might be better if you describe the model more explicitly.

-------------------------------------------
Emil M Friedman, PhD
Principal Scientist (Statistician)
MannKind Corporation
Danbury, CT 06810
Disclaimer: The views expressed are mine alone and do not necessarily reflect the views of my employer.
-------------------------------------------
5. RE:weighting in logistic regression

Recommend
Stanislav Kolenikov
Posted 10-26-2011 11:27
Weighted estimation (in SAS or in other packages) is performed by summing up the likelihood contributions (for logistic regression, log{P(y_i=1'X_i)^y_i + [1-P(y_i=1'X_i)]^(1-y_i)} ) with user-supplied weights, rather than with implicit weights of 1 appropriate for i.i.d. data. The different kinds of weights are frequency weights (when you decided to save on computer memory and collapse identical observations together; I don't think it happens any more in applied work unless you have a handful of categorical variables where this could be relevant), analytic weights (inversely proportional to the measurement error variance, as suggested by Emil Friedman), and probability weighs (inverse probabilities of selection correcting for mismatch between the unweighted sample and the population of interest), see http://www.ats.ucla.edu/stat/Stata/faq/weights.htm (which by and large repeats Stata help file, http://www.stata.com/help.cgi?weight). SAS only interprets weights as frequency weights, except in PROC SURVEY* procedures where it interprets them as probability weights. (In Stata, you can specify the whatever interpretation you like with most commands; the help for each command would indicate what kind of weights it supports -- you need a different calculation of the standard errors with probability weights).

Weighting affects a few things in the results, as has been demonstrated in many expository articles in survey statistics (see, e.g., Korn & Graubard (1995) articles in JRSS-A (http://www.jstor.org/stable/2983292) and TAS (http://www.jstor.org/stable/2684203), or more technical Pfeffermann (ISR 1993, http://www.stat.iastate.edu/seminars/abstracts/seminars2006-2007/VIGRE_Survey_16Apr07.pdf; Stat Methods in Medicine, http://smm.sagepub.com/content/5/3/239.short). Of course the point estimates themselves change. Also, the estimates become less efficient as compared to i.i.d. case. Korn & Graubard tend to make a fuss of it; many survey statisticians are prepared to pay this price to get a trade-off with other features of a survey, such as lower costs, or gains in important aspects of stratification.

To get an intuition of how your point estimates might (or might not) change, consider a bivariate linear regression model first. If the weights are only associated with the x variable, then you are essentially adding more of the existing points in your design, shifting the weight to the left or to the right. The regression line itself cannot change, though. You can in fact gain a little bit of efficiency if you increase the weights of the extreme observations (http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?catno=12-001-X200800110615&lang=eng): as you might recall from your design of experiments class, the optimal design for the linear regression is to put half of the mass to the leftmost point of the range of x, and the other half, to the rightmost point. You can move in that direction with reweighting the x's.

However, more interesting things begin to happen when you have weights associated with y's. If say larger values of y are associated with larger weights, then the regression line will be pulled towards these larger observations, so the slope will increase in magnitude, and the intercept will adjust accordingly, so that the regression line passes through the weighted mean of the data.

Things won't be terribly different with logistic regression when you reweight for gender. Since the weights will only be related to the x variable rather than the outcome, I would expect some efficiency changes, but I would not expect huge swings in the coefficients themselves, unless your original model suffers from omitted variables that are correlated with both gender and the outcome. So you might want to run PROC SURVEYLOGISTIC with new weights and see if the coefficients a say within a standard error from one another (taking the largest standard error from the weighted and unweighted runs).

-------------------------------------------
Stanislav Kolenikov
University of Missouri
-------------------------------------------
6. RE:weighting in logistic regression

Recommend
William Huber
Posted 10-26-2011 13:13
> I want to re-weight the data so the fitted model will be theoretically applicable to a population with a different proportion of females from the analytic sample. ... the resulting model should be unbiased, although the reported coefficient standard errors might be misleading. Will applying the WEIGHT option in SAS accomplish this? If so, how does it actually do the weighting within the procedure? (I mean mathematically, not how to program). If not, is there a better approach?

I hope I have not misunderstood, because this seems like it has such a simple, natural solution. Isn't this question asking about how to make a prediction for a future population based on a model estimated from a sample? Why, then, should the procedure be any different from any other regression-based prediction? In particular, why should any special weighting be necessary? Just include gender as a variable in the model and use that for the prediction.

To understand what this does, consider the simplest case where the original logistic regression model (without gender) merely fits a constant beta to a binary response. With k 1's and n-k 0's in the dataset, the likelihood is maximized for beta = log(k) - log(n-k) corresponding to an estimated probability of k/n. Applying this to a population of N people, we would estimate N(k/n) 1's would occur in the population.

Including gender in this simple example is tantamount to dividing the data into k_m 1's for n_m males and k_f = k-k_m 1's for n_f = n-n_m females. The likelihood is maximized when the male probability equals k_m/n_m and the female probability equals k_f/n_f. To apply this to a population with f females and m males, we would estimate the number of 1's to be f(k_f/n_f) + m(k_m/n_m). This is straightforward, easy to interpret, and flexible (because it can be applied to any future population without any refitting of the data). Standard errors of prediction are just as easily propagated, especially if f and m are known and not estimated with any error.

In the more complex case with additional explanatory variables, some assumptions must be made about their distributions within the future population. Nevertheless, the concept and method still work: apply the prediction from the fitted model to the future population. Including gender as one of the explanatory variables automatically performs the desired "weighting." There does not appear to be any need to weight the fitting procedure beforehand.

Best,

Bill Huber
Quantitative Decisions
7. RE:weighting in logistic regression

Recommend
Herbert Weisberg
Posted 10-27-2011 19:00
Thanks for all the responses, especially Stanislav's very detailed and informative one. I now realize that my question was ambiguous and this has given me a better handle on how to approach the problem.

-------------------------------------------
Herbert Weisberg
President
Correlation Research, Inc.
-------------------------------------------

Original Message

Discussion: View Thread

weighting in logistic regression

Herbert Weisberg10-24-2011 14:31

Stephen Simon10-24-2011 17:24

Jonathan Shuster10-25-2011 08:57

Emil Friedman10-25-2011 09:57

Stanislav Kolenikov10-26-2011 11:27

William Huber10-26-2011 13:13

Herbert Weisberg10-27-2011 19:00

1. weighting in logistic regression

2. RE:weighting in logistic regression

3. RE:weighting in logistic regression

4. RE:weighting in logistic regression

5. RE:weighting in logistic regression

6. RE:weighting in logistic regression

7. RE:weighting in logistic regression