ASA Connect

 View Only
  • 1.  Questions on interval censored data

    Posted 01-16-2020 14:22
    Dear colleagues, 

    I have some questions on interval-censored data I would very much appreciate your help with.  

    The data come from a series of subjects who become members of a club through which they can make purchases. 
    The exact date when they sign up to become members is not known - it is approximated by the date when they make the first such purchase.  
    The exact date when they terminate their membership is also not known; if a subject didn't make any purchases through the club for a period of 6 months since their last recorded club purchase, they are considered to have terminated their membership. 

    The goal is to see how a series of time-independent or time-dependent predictor variables affect club membership termination.   

    For each subject, I can create time intervals of the form [0, T1), [T1, T2), etc., where 0 = date of first club purchase, T1 = 6 months after date of first club purchase, T2 = 12 months after date of first club purchase, etc., and keep track of whether or not the subject terminated their club membership in that interval.  For someone who terminated their club membership, this would lead to data of the form :

    Interval      MembershipStatus  
    [0, t1)                     0         <--- still a club member  
    [t1, t2)                    0         <--- still a club member 
    ...

    [tk, tk+1)                1         <--- not a club member 

    Note that the values of T1, T2, etc., will be different across subjects. As an example, T1 might be Dec. 1, 2012 for one subject and Apr. 20, 2016 for another subject. 

    Question 1:  At the modelling stage, what is the best way to handle the fact that different people will have different numbers of 6-month intervals?  

    If I were to use a GLM or GLMER model to model membership status as a function of Interval (perhaps coded as 1 for [0, t1], 2 for [t1, t2), etc.)  and the predictor variables, would I have to focus the modelling on (a) just those members who have exactly k+1 intervals of follow-up (say), (b) just those members who have k+1 or fewer intervals of follow-up? The value of k could be 1, say, so that k + 1 = 2 follow-up intervals.   

    Essentially, I am looking to fit a "discrete time" model (where "discrete time" refers to an interval) to the data, while also taking into account the differing lengths of follow-up of the different subjects (some of whom have quite a large number of follow up intervals - several years' worth - and some of whom have only a few intervals).  If I wanted to model the risk of club membership termination after, say, 2 years, would I need to focus on just those subjects satisfying either (a) or (b) above (whichever makes most sense)?   

    Question 2:  At the modelling stage, what is the best way to define the time-varying covariates?  

    One of the reasons subjects may terminate their club membership is because they could make their purchases through other channels (e.g., online) from the same company.  Thus, one of the time-dependent predictor variables is "number of purchases made through online channels". Ideally, we want a value for this variable for each of our time intervals: 

    Interval      MembershipStatus    NumberOnlinePurchases
    [0, t1)                     0                                    4
    [t1, t2)                    0                                    2 
    …                          …                                   ...
    [tk, tk+1)                1                                    6

    The calculation of the value of this variable for the last interval seems a bit problematic.  The subject would NOT have made any club purchases in that period (which is why we assumed they terminated their membership) but in principle they could have terminated their membership anywhere during that interval - so we don't want to include online purchases that would have been made AFTER the termination of the club membership.   In other words, we don't want to use the future to predict the present.  

    I have seen recommendations in the literature to compute the value of time-dependent variables at the beginning of each time interval (i.e., at times 0, t1, ,.., tk) but in this case we are really interested in computing the value of time-dependent variables across the entire time interval.  Is that acceptable from a statistical viewpoint?  Or, at the very least, would one have to compute the values of the time-dependent variables differently for those intervals where MembershipStatus = 0 (e.g., aggregate online sales over the entire interval) and for those where MembershipStatus = 1 (e.g., consider online sales only at the beginning of the interval, namely only at time tk).      

    Many thanks, 

    Isabella

    ------------------------------
    Isabella Ghement
    Ghement Statistical Consulting Company Ltd.
    E-mail: Isabella@Ghement.ca
    Phone: 604-767-1250
    ------------------------------


  • 2.  RE: Questions on interval censored data

    Posted 01-17-2020 18:00
    I see that someone who does not make purchases through the club for six months is considered to have terminated their membership. What if someone goes a whole year between making purchases through the club? Will they be considered to have terminated and then reactivated their membership?​

    ------------------------------
    Eric Siegel, MS
    Biostatistics Project Manager
    Department of Biostatistics
    Univ. Arkansas Medical Sciences
    ------------------------------



  • 3.  RE: Questions on interval censored data

    Posted 01-17-2020 18:33
    Thanks very much for your question, Eric.  It's something I wondered about too.  Currently, termination of membership ignores whether someone went for a whole year without making purchases sometime before we deem they terminated the membership.  Should such people exist in the data, I guess we can be more careful with our current definition of termination and define it as "first termination of membership".  Either that or go via the recurrent events route, which I would prefer not to do at this stage if possible. 
     
    Thanks also for the other two answers I received on this question (including the one referring me to Paul Allison's excellent survival book), both of which were most helpful.

    ------------------------------
    Isabella Ghement
    Ghement Statistical Consulting Company Ltd.
    ------------------------------



  • 4.  RE: Questions on interval censored data

    Posted 01-21-2020 17:36

    It is not clear to me that this is actual an interval censoring problem.

    If we assume that most customers don't actually bother to write in and request termination of their membership, then the terminating event is the company's operation of terminating their membership 6 months after their last history. Assuming you have access to company sales records, then you know its date exactly. It is a point event, not an interval censoring. 


    What are you actually interested in? Is there something else that termination is being used as a proxy for? One posssibility might be the customer's loss of interest in the company. If that's the case, while last purchase is a clear begin point, it's not all clear that termination is a valid interval endpoint. The company, and you with it, are simply assuming that customers have lost interest in the company if they haven't bought from it in 6 months. 


    And if that's the case, you have data to support using the termination date as an interval end. There are many perfectly viable companies whose typical customer only buys from them once a year. If your company is taking actions based on baseless assumptions rather than data, it might be one of them and not know it, in which a different management style (including ditching the 6-month automatic termination) would probably be more effective than what it's doing.

    It's your job to find out whether what the company thinks it knows is really the case or not. Maybe it "knows" something that ain't so.

    Embedding baseless assumptions into your methods won't help you find out.

    It might be worth finding out what proportion of terminated customers get reinstated and how long the interval to reinstatement (another point event) is.

    But does your company even have the ability to find out? Maybe it's so confident customers lose interest after 6 months that its systems don't even have the ability to tell that new customer A who just got a new loyalty number is actually the same customer as "former" customer B whose loyalty number was terminated 6 months ago, 6 months after last month's purchase.

    And if that's the case, I wouldn't bother trying to analyze this database at all because it's reported results will be nonsense, a parrot of the company's baseless assumptions. I'd find out. If the company doesn't have an ability to determine that "new" customers are actually the same as "previous" terminated customers, it will continue to split customers into artificial boxes using baseless assumptions, those boxes will continue to shield its management from any useful knowledge of what's going on, and it will continue not to have a clue even whether, let alone when, its customers have actually lost interest. It won't even be able to tell how many customers it actually have. It may have a core of loyal customers who buy annually and think it has many fleeting ones.

    So in this case I'd focus on how the data it has was acquired, what that data means, and to what extent the assumptions behind behind its methods are biasing the data and misleading it. 


    That's not an interval censoring problem at all.



    ------------------------------
    Jonathan Siegel
    Director Clinical Statistics
    ------------------------------