ASA Community

2. RE: logistic regression

Recommend

Joseph Lucke

Posted 10-07-2021 10:00

Your approach is incorrect.
Logistic regression is appropriate for Bernoulli {0,1} random variables (RVs). Your outcomes are not Bernoulli, but sums of Bernoulli RVs. They are not even binomial because the sums are not independent, being correlated within persons.

As I see it, there are two approaches.
One is to use a normalizing transformation, such a the arcsine transformation, on your outcome proportions.
Two is to use beta regression. There is a beta regression package in R.

Joe Lucke
Retired,
Formerly Senior Statistician, Research Institute on Addictions, SUNY Buffalo

------------------------------
Joseph Lucke
SUNY Research Professor Emeritus
Retired
------------------------------

Original Message

Original Message:
Sent: 10-06-2021 18:10
From: Harmon Jordan
Subject: logistic regression

I have 300 individuals who rated binary responses (1 or 0) on the same set of 500 questionnaires. I want to fit a logistic regression model to explore the association between the odds of rating (1 or 0) and demographic characteristics (gender, age group, race, and education level).

Currently, I am fitting glm() in R

glm(proportion ~ Gender + Age + Race + Education, data = data,     weights=rep(500, 300), family=binomial(link = "logit"))

Here, proportion = the proportion of questions where each person marked 1. For example, if an individual marked 1 for 200 questions and 0 for the remaining 300 questions, the proportion would be 0.4

For logistic regression, R's glm() usually takes a binary variable (1 or 0) as a response variable. However, glm() also takes a proportion as a response variable if I specify the 'weights' argument, which is the number of total trials, which is 500 (questionnaires) for everyone.

Is this the right approach? I'm concerned that there might be an issue of non-independence of observations, given that everyone is given the same set of 500 questions. The individuals are independent among each other, though.

Should I fit mixed models or anything else?

Thanks!

Harmon Jordan, ScD
Health Research Consultant

------------------------------
Harmon Jordan
------------------------------

3. RE: logistic regression

Recommend

Andrew Brown

Posted 10-09-2021 16:30

I imagine it depends a little on what these 500 questionnaires are, but if they are exchangeable enough to evaluate as a proportion then there is this:

HOW DOES ONE DO REGRESSION WHEN THE DEPENDENT VARIABLE IS A PROPORTION? https://stats.idre.ucla.edu/stata/faq/how-does-one-do-regression-when-the-dependent-variable-is-a-proportion/

The analogy would be that each question is correlated within person much like children are correlated within school. Thus, violation of independence on each single answer, but not in the summary proportion. They also use logit/binomial.

I would be curious about your use case in which 500 questions can be summarized as a proportion of 1/0 responses. Also: no interest in interactions among predictors?

------------------------------
Andrew Brown
Assistant Professor
Indiana University Bloomington
------------------------------

Original Message

Original Message:
Sent: 10-06-2021 18:10
From: Harmon Jordan
Subject: logistic regression

I have 300 individuals who rated binary responses (1 or 0) on the same set of 500 questionnaires. I want to fit a logistic regression model to explore the association between the odds of rating (1 or 0) and demographic characteristics (gender, age group, race, and education level).

Currently, I am fitting glm() in R

glm(proportion ~ Gender + Age + Race + Education, data = data,     weights=rep(500, 300), family=binomial(link = "logit"))

Here, proportion = the proportion of questions where each person marked 1. For example, if an individual marked 1 for 200 questions and 0 for the remaining 300 questions, the proportion would be 0.4

For logistic regression, R's glm() usually takes a binary variable (1 or 0) as a response variable. However, glm() also takes a proportion as a response variable if I specify the 'weights' argument, which is the number of total trials, which is 500 (questionnaires) for everyone.

Is this the right approach? I'm concerned that there might be an issue of non-independence of observations, given that everyone is given the same set of 500 questions. The individuals are independent among each other, though.

Should I fit mixed models or anything else?

Thanks!

Harmon Jordan, ScD
Health Research Consultant

------------------------------
Harmon Jordan
------------------------------

5. RE: logistic regression

Recommend

Hosseinali Asgharian

Posted 10-14-2021 03:08

Here are my two cents:

As mentioned in previous answers, you cannot run logistic regression on proportions; the response variable (let's call it "Answer") must be encoded as a binary variable (let's say 0 = No, 1 = Yes). Disregarding interdependence of questions for now, I would set up the dataset in 300x500 rows, each row representing one subject's answer to one question. Columns would be: Answer, Subject_ID, Question_ID, Age, Sex, Ethnicity, Education. And the logistic regression command:

`glm(Answer ~ Age + Sex + Ethnicity + Education, data = data_long, family = "binomial")`

Logit is the default link for the binomial family, so it does not have to be specified. In modified versions, you can add Question_ID and/or Subject_ID to the model and check if they show confounding effects.

If you are concerned about interdependence of questions (similar questions in the questionnaire with potentially correlated answers), I'd suggest doing a bit of exploratory analysis first. You need to reshape the dataset to wide format (data %>% dplyr::spread(Question_ID, Answer) should do it). This will give you one variable per question. Now you can do several things: for example, do hierarchical clustering via the heatmap() function in R with questions in rows and subjects in columns. Label the columns by the metadata variables (age, sex, ethnicity, education) and explore the patterns, see if you observe blocks of correlated questions. You can also create the heatmap from the 500x500 correlation matrix.

For hypothesis testing to check the effect of metadata variables on probability of answering Yes or No, you have a few options:
1. Use the long format and the by() function in R (or subset) to run the logistic regression separately for each question. Then check out the correlation of regression coefficients and log p-values.
2. Use the wide format and do a PCA on the 500 answer variables (each variable answer to one of the questions). Keep the first few PCs that retain ~80% of variation. The PCs will be orthogonal, so no worry about collinearity. Check the distributions of the PCs; if they are unimodal and reasonably bell-shaped, use MANOVA with all these PCs as response variables and the metadata variables as predictors.

Hope this helps!

Hossein

------------------------------
Hosseinali Asgharian
Postdoctoral Scholar
Genentech Hall, Room S312D
University of California San Francisco
600 16th St, San Francisco, CA 94158
------------------------------

Original Message

Original Message:
Sent: 10-06-2021 18:10
From: Harmon Jordan
Subject: logistic regression

I have 300 individuals who rated binary responses (1 or 0) on the same set of 500 questionnaires. I want to fit a logistic regression model to explore the association between the odds of rating (1 or 0) and demographic characteristics (gender, age group, race, and education level).

Currently, I am fitting glm() in R

glm(proportion ~ Gender + Age + Race + Education, data = data,     weights=rep(500, 300), family=binomial(link = "logit"))

Here, proportion = the proportion of questions where each person marked 1. For example, if an individual marked 1 for 200 questions and 0 for the remaining 300 questions, the proportion would be 0.4

For logistic regression, R's glm() usually takes a binary variable (1 or 0) as a response variable. However, glm() also takes a proportion as a response variable if I specify the 'weights' argument, which is the number of total trials, which is 500 (questionnaires) for everyone.

Is this the right approach? I'm concerned that there might be an issue of non-independence of observations, given that everyone is given the same set of 500 questions. The individuals are independent among each other, though.

Should I fit mixed models or anything else?

Thanks!

Harmon Jordan, ScD
Health Research Consultant

------------------------------
Harmon Jordan
------------------------------

ASA Connect

logistic regression

Harmon Jordan10-06-2021 18:11

Joseph Lucke10-07-2021 10:00

Andrew Brown10-09-2021 16:30

Andrew Brown10-09-2021 16:33

Hosseinali Asgharian10-14-2021 03:08

1. logistic regression

2. RE: logistic regression

3. RE: logistic regression

4. RE: logistic regression

5. RE: logistic regression

Contact Us

Membership

Privacy

Follow Us

ASA Connect

logistic regression

Harmon Jordan10-06-2021 18:11

Joseph Lucke10-07-2021 10:00

Andrew Brown10-09-2021 16:30

Andrew Brown10-09-2021 16:33

Hosseinali Asgharian10-14-2021 03:08

1. logistic regression

2. RE: logistic regression

3. RE: logistic regression

4. RE: logistic regression

5. RE: logistic regression

Related Content

GLM residual deviance formula for logistic model binomial link

Adaptive Elastic Net for GLM in R

Two-Day Workshop: Applied Meta Analysis and Logistic Regression for Correlated Data

Veridical PCS data science (PCS = predictability, computability and stability)

Bayesian logistic regression

Contact Us

Membership

Privacy

Follow Us