Here are my two cents:
As mentioned in previous answers, you cannot run logistic regression on proportions; the response variable (let's call it "Answer") must be encoded as a binary variable (let's say 0 = No, 1 = Yes). Disregarding interdependence of questions for now, I would set up the dataset in 300x500 rows, each row representing one subject's answer to one question. Columns would be: Answer, Subject_ID, Question_ID, Age, Sex, Ethnicity, Education. And the logistic regression command:
`glm(Answer ~ Age + Sex + Ethnicity + Education, data = data_long, family = "binomial")`
Logit is the default link for the binomial family, so it does not have to be specified. In modified versions, you can add Question_ID and/or Subject_ID to the model and check if they show confounding effects.
If you are concerned about interdependence of questions (similar questions in the questionnaire with potentially correlated answers), I'd suggest doing a bit of exploratory analysis first. You need to reshape the dataset to wide format (data %>% dplyr::spread(Question_ID, Answer) should do it). This will give you one variable per question. Now you can do several things: for example, do hierarchical clustering via the heatmap() function in R with questions in rows and subjects in columns. Label the columns by the metadata variables (age, sex, ethnicity, education) and explore the patterns, see if you observe blocks of correlated questions. You can also create the heatmap from the 500x500 correlation matrix.
For hypothesis testing to check the effect of metadata variables on probability of answering Yes or No, you have a few options:
1. Use the long format and the by() function in R (or subset) to run the logistic regression separately for each question. Then check out the correlation of regression coefficients and log p-values.
2. Use the wide format and do a PCA on the 500 answer variables (each variable answer to one of the questions). Keep the first few PCs that retain ~80% of variation. The PCs will be orthogonal, so no worry about collinearity. Check the distributions of the PCs; if they are unimodal and reasonably bell-shaped, use MANOVA with all these PCs as response variables and the metadata variables as predictors.
Hope this helps!
Hossein
------------------------------
Hosseinali Asgharian
Postdoctoral Scholar
Genentech Hall, Room S312D
University of California San Francisco
600 16th St, San Francisco, CA 94158
------------------------------
Original Message:
Sent: 10-06-2021 18:10
From: Harmon Jordan
Subject: logistic regression
I have 300 individuals who rated binary responses (1 or 0) on the same set of 500 questionnaires. I want to fit a logistic regression model to explore the association between the odds of rating (1 or 0) and demographic characteristics (gender, age group, race, and education level).
Currently, I am fitting glm()
in R
glm(proportion ~ Gender + Age + Race + Education, data = data, weights=rep(500, 300), family=binomial(link = "logit"))
Here, proportion = the proportion of questions where each person marked 1. For example, if an individual marked 1 for 200 questions and 0 for the remaining 300 questions, the proportion would be 0.4
For logistic regression, R's glm()
usually takes a binary variable (1 or 0) as a response variable. However, glm()
also takes a proportion as a response variable if I specify the 'weights' argument, which is the number of total trials, which is 500 (questionnaires) for everyone.
Is this the right approach? I'm concerned that there might be an issue of non-independence of observations, given that everyone is given the same set of 500 questions. The individuals are independent among each other, though.
Should I fit mixed models or anything else?
Thanks!
Harmon Jordan, ScD
Health Research Consultant
------------------------------
Harmon Jordan
------------------------------