Dear colleagues,
I have some questions on interval-censored data I would very much appreciate your help with.
The data come from a series of subjects who become members of a club through which they can make purchases.
The exact date when they sign up to become members is not known - it is approximated by the date when they make the first such purchase.
The exact date when they terminate their membership is also not known; if a subject didn't make any purchases through the club for a period of 6 months since their last recorded club purchase, they are considered to have terminated their membership.
The goal is to see how a series of time-independent or time-dependent predictor variables affect club membership termination.
For each subject, I can create time intervals of the form [0, T1), [T1, T2), etc., where 0 = date of first club purchase, T1 = 6 months after date of first club purchase, T2 = 12 months after date of first club purchase, etc., and keep track of whether or not the subject terminated their club membership in that interval. For someone who terminated their club membership, this would lead to data of the form :
Interval MembershipStatus
[0, t1) 0 <--- still a club member
[t1, t2) 0 <--- still a club member
...
[tk, tk+1) 1 <--- not a club member
Note that the values of T1, T2, etc., will be different across subjects. As an example, T1 might be Dec. 1, 2012 for one subject and Apr. 20, 2016 for another subject.
Question 1: At the modelling stage, what is the best way to handle the fact that different people will have different numbers of 6-month intervals?
If I were to use a GLM or GLMER model to model membership status as a function of Interval (perhaps coded as 1 for [0, t1], 2 for [t1, t2), etc.) and the predictor variables, would I have to focus the modelling on (a) just those members who have exactly k+1 intervals of follow-up (say), (b) just those members who have k+1 or fewer intervals of follow-up? The value of k could be 1, say, so that k + 1 = 2 follow-up intervals.
Essentially, I am looking to fit a "discrete time" model (where "discrete time" refers to an interval) to the data, while also taking into account the differing lengths of follow-up of the different subjects (some of whom have quite a large number of follow up intervals - several years' worth - and some of whom have only a few intervals). If I wanted to model the risk of club membership termination after, say, 2 years, would I need to focus on just those subjects satisfying either (a) or (b) above (whichever makes most sense)? Question 2: At the modelling stage, what is the best way to define the time-varying covariates? One of the reasons subjects may terminate their club membership is because they could make their purchases through other channels (e.g., online) from the same company. Thus, one of the time-dependent predictor variables is "number of purchases made through online channels". Ideally, we want a value for this variable for each of our time intervals:
Interval MembershipStatus NumberOnlinePurchases [0, t1) 0 4[t1, t2) 0 2 … … ...
[tk, tk+1) 1 6The calculation of the value of this variable for the last interval seems a bit problematic. The subject would NOT have made any club purchases in that period (which is why we assumed they terminated their membership) but in principle they could have terminated their membership anywhere during that interval - so we don't want to include online purchases that would have been made AFTER the termination of the club membership. In other words, we don't want to use the future to predict the present.
I have seen recommendations in the literature to compute the value of time-dependent variables at the beginning of each time interval (i.e., at times 0, t1, ,.., tk) but in this case we are really interested in computing the value of time-dependent variables across the entire time interval. Is that acceptable from a statistical viewpoint? Or, at the very least, would one have to compute the values of the time-dependent variables differently for those intervals where MembershipStatus = 0 (e.g., aggregate online sales over the entire interval) and for those where MembershipStatus = 1 (e.g., consider online sales only at the beginning of the interval, namely only at time tk).
Many thanks,
Isabella
------------------------------
Isabella Ghement
Ghement Statistical Consulting Company Ltd.
E-mail:
Isabella@Ghement.caPhone: 604-767-1250
------------------------------