ASA Connect

 View Only
  • 1.  logistic regression

    Posted 10-06-2021 18:11

    I have 300 individuals who rated binary responses (1 or 0) on the same set of 500 questionnaires. I want to fit a logistic regression model to explore the association between the odds of rating (1 or 0) and demographic characteristics (gender, age group, race, and education level).

    Currently, I am fitting glm() in R

    glm(proportion ~ Gender + Age + Race + Education, data = data, 
        weights=rep(500, 300), family=binomial(link = "logit"))
    

    Here, proportion = the proportion of questions where each person marked 1. For example, if an individual marked 1 for 200 questions and 0 for the remaining 300 questions, the proportion would be 0.4

    For logistic regression, R's glm() usually takes a binary variable (1 or 0) as a response variable. However, glm() also takes a proportion as a response variable if I specify the 'weights' argument, which is the number of total trials, which is 500 (questionnaires) for everyone.

    Is this the right approach? I'm concerned that there might be an issue of non-independence of observations, given that everyone is given the same set of 500 questions. The individuals are independent among each other, though.

    Should I fit mixed models or anything else?

    Thanks!

    Harmon Jordan, ScD
    Health Research Consultant



    ------------------------------
    Harmon Jordan
    ------------------------------


  • 2.  RE: logistic regression

    Posted 10-07-2021 10:00
    Your approach is incorrect. 
    Logistic regression is appropriate for Bernoulli {0,1} random variables (RVs).  Your outcomes are not Bernoulli, but sums of Bernoulli RVs.  They are not even binomial because the sums are not independent, being correlated within persons.

    As I see it, there are two approaches.
    One is to use a normalizing transformation, such a the arcsine transformation, on your outcome proportions.
    Two is to use beta regression.  There is a beta regression package in R.

    Joe Lucke
    Retired,
    Formerly Senior Statistician, Research Institute on Addictions, SUNY Buffalo

    ------------------------------
    Joseph Lucke
    SUNY Research Professor Emeritus
    Retired
    ------------------------------



  • 3.  RE: logistic regression

    Posted 10-09-2021 16:30
    I imagine it depends a little on what these 500 questionnaires are, but if they are exchangeable enough to evaluate as a proportion then there is this: 

    HOW DOES ONE DO REGRESSION WHEN THE DEPENDENT VARIABLE IS A PROPORTION? https://stats.idre.ucla.edu/stata/faq/how-does-one-do-regression-when-the-dependent-variable-is-a-proportion/

    The analogy would be that each question is correlated within person much like children are correlated within school. Thus, violation of independence on each single answer, but not in the summary proportion. They also use logit/binomial.

    I would be curious about your use case in which 500 questions can be summarized as a proportion of 1/0 responses. Also: no interest in interactions among predictors?

    ------------------------------
    Andrew Brown
    Assistant Professor
    Indiana University Bloomington
    ------------------------------



  • 4.  RE: logistic regression

    Posted 10-09-2021 16:33
    One other hiccup that just entered my head after hitting "post": unlike the school analogy (different children across schools), the same questions are asked of all of your participants. Not sure the ramifications of that...

    ------------------------------
    Andrew Brown
    Assistant Professor
    Indiana University Bloomington
    ------------------------------



  • 5.  RE: logistic regression

    Posted 10-14-2021 03:08

    Here are my two cents:

    As mentioned in previous answers, you cannot run logistic regression on proportions; the response variable (let's call it "Answer") must be encoded as a binary variable (let's say 0 = No, 1 = Yes). Disregarding interdependence of questions for now, I would set up the dataset in 300x500 rows, each row representing one subject's answer to one question. Columns would be: Answer, Subject_ID, Question_ID, Age, Sex, Ethnicity, Education. And the logistic regression command:


    `glm(Answer ~ Age + Sex + Ethnicity + Education, data = data_long, family = "binomial")`

    Logit is the default link for the binomial family, so it does not have to be specified. In modified versions, you can add Question_ID and/or Subject_ID to the model and check if they show confounding effects. 

    If you are concerned about interdependence of questions (similar questions in the questionnaire with potentially correlated answers), I'd suggest doing a bit of exploratory analysis first. You need to reshape the dataset to wide format (data %>% dplyr::spread(Question_ID, Answer) should do it). This will give you one variable per question. Now you can do several things: for example, do hierarchical clustering via the heatmap() function in R with questions in rows and subjects in columns. Label the columns by the metadata variables (age, sex, ethnicity, education) and explore the patterns, see if you observe blocks of correlated questions. You can also create the heatmap from the 500x500 correlation matrix. 

    For hypothesis testing to check the effect of metadata variables on probability of answering Yes or No, you have a few options:
    1. Use the long format and the by() function in R (or subset) to run the logistic regression separately for each question. Then check out the correlation of regression coefficients and log p-values.
    2. Use the wide format and do a PCA on the 500 answer variables (each variable answer to one of the questions). Keep the first few PCs that retain ~80% of variation. The PCs will be orthogonal, so no worry about collinearity. Check the distributions of the PCs; if they are unimodal and reasonably bell-shaped, use MANOVA with all these PCs as response variables and the metadata variables as predictors. 

    Hope this helps!

    Hossein



    ------------------------------
    Hosseinali Asgharian
    Postdoctoral Scholar
    Genentech Hall, Room S312D
    University of California San Francisco
    600 16th St, San Francisco, CA 94158
    ------------------------------