Hi all,
I was wondering if anyone on this mailing list might be able to give me some advice on a certain problem I've been tasked with. I have a particularly complex and challenging study design, and I'm trying to determine the best way to parameterize and set up the model.
The study design involves each subject getting a number of measurements repeated over the course of 1 or 2 periods. In Period 1, all subjects are observed the same number of times. At each measurement, the presence/absence of a particular Outcome Variable of interest is obtained (as a binary Yes/No), along with several (continuous) Factors believed to be predictive of the Outcome. Note that the presence of the Outcome at time t does not automatically imply it will be present at time t+1 or later times.
If a subject has exhibited presence of the Outcome frequently enough in Period 1 to meet a specified threshold, they proceed to Period 2. Otherwise, they are discontinued and don't have any observations in Period 2. For subjects who proceed to Period 2, there is not a set number of observations like there is in Period 1, but a small range of possible numbers of observations. However, even the subject(s) who end(s) up with the most observations in Period 2 will still have far fewer observations than the number of times they were measured in Period 1.
So, there are two questions I would like to investigate. First, without considering the Outcome at all, I would like to examine if there are any significant differences in the values of any of the Factor_1 through Factor_j, in general, between observations in Period 1 and Period 2, taking into account the multiple observations for the same subject. This might be as simple as a GLM with MANOVA, with each of the Factors as the response and Period as the explanatory? (If I didn't have to account for the multiple observations for the same subject, and there was only 1 observation per subject in each Period, I would probably consider something like a paired t-test.)
Second, if there are any significant differences in how Factor_1 through Factor_j model the Outcome, between observations in Period 1 and Period 2, again taking into account that there are multiple observations for the same subject. If it were just one period, I think it would probably be fairly straightforward, with something like a logistic regression that has Factor_1 through Factor_j as the explanatory variables, the Outcome Variable as the response, a stratified study design (stratified by subject), and looking at which of the Factors are predictive. However, here I'm not interested in which of the Factors have a significant relationship with the Outcome, but rather what (if any) are the significant differences between Period 1 and Period 2 in how the Factors model the Outcome. In other words, I might get a certain value for the coefficient for Factor_2 based on Period 1, and another value for the coefficient for Factor_2 based on Period 2, and would like to know if the difference in these is meaningful or not. One method I was thinking about was a proportional hazards model, with conditional logistic regression, stratifying on a subject level. Another idea was to setup a GEE, but was having trouble coming up with a model that properly accounted for everything.
Any advice you might be able to offer would be most appreciated!
Best Regards,
Gabriel Farkas