Hi Naomi.
I've used your Option 1 before. I'm not sure I'd call the binary outcomes you're describing "pseudo" data because you know the counts of 0s and 1s for each site. In the absence of other patient-level data, the order of the 0s and 1s doesn't matter. This approach allows you to fit a mixed logistic regression model or alternative (e.g., beta-binomial), taking clustering within clinic into account.
It's trickier with multiple time points because you can't match patient outcomes across time points. I believe you could take the patient-level outcome at the final time point as your outcome variable and then include the clinic-level proportion at a previous time point (baseline, say) as a covariate in the model. Not ideal, but defensible.
I'm not sure about Beta regression on the clinic-level proportions unless there's a way to incorporate the precision of the proportions (reflecting sample size in each clinic) into the model.
Finally, I'm not an expert on the topic, but epidemiologists sometimes use cluster-level methods for this kind of analysis (e.g., see Donnar & Klar, American Journal of Epidemiology 1994; 140:279–289).
I'll be curious to hear what ideas others have.
Cheers,
Vince
------------------------------
Vincent Staggs, PhD
Director, Biostatistics & Epidemiology Core, Children's Mercy Research Institute;
Professor, School of Medicine, University of Missouri-Kansas City
------------------------------
Original Message:
Sent: 05-25-2023 17:02
From: Naomi Brownstein
Subject: multilevel trial analysis plan second opinion
Hi All,
Apologies for the long message. I have a few questions where I could use additional advice/opinions about the analysis plan for a study I am on.
As background, it's a community-based intervention where we are testing if vaccination rates increase at a collection of 8 health care systems (highest level), each of which have 7-8 clinics, each of which see thousands of patients. The original design was a stepped wedge with 1 step, so 4 systems were assigned to receive the intervention first, followed by the remaining 4 systems later. As part of the original design, it was well-powered to detect changes in vaccination rates measured at the individual (deidentified) patient level. In addition to the intervention, the PIs wanted to explore various mediators and moderators. The original plan was thus to use a generalized linear mixed model treating the individual outcomes as binary and of course accounting for the different time points (5) and nested structure (people within clinics within system).
Since then, I left the institution shortly before grant was funded, but I have kept in some contact with the current statistician, who has filled in and informed me of some changes presented to us. (This has all come back because I am included on a paper about the design and methods that they intend to submit soon.) Briefly, we were told that the systems may be unable or unwilling to provide the team with individual patient level data. At one point, to the chagrin of all of us, we were told that the outcome (vaccination rates) it would be measured at the system level, collapsed across the other levels, meaning we would have only 8 individual measurements! (Note that these rates are between 0 and 1 but no longer binary.) The statistician and I agreed this is not tenable (model will not be identifiable with only 8 points and multiple covariates), so I plan to inform them that this won't work and we need a new plan, either to get data at the patient level as planned (statistically ideal but maybe not feasible in practice), or compromise at the clinic level (~60 clinics). Thus my questions are the following.
1) The other statistician had an interesting idea. If we have proportions (vaccination rates, between 0 and 1) at the clinic levels, e.g. p1, p2, ... p60 at a given time point, and we know the clinic sizes (n1 patients in clinic 1, n2, ... n60 patients in clinic 60), then he suggested we generate pseudo observations of for example n1p1 observations with the outcome of interest (1s) and similarly n1(1-p1) without the outcome of interest (0s). These should correspond to deidentified patient level outcomes (some vaccinated:1, some not:0). Then he suggested running a model on the pseudo data where we could include clinic level covariates or system level covariates only. (Obviously any patient level data is masked and lost, e.g. demographics, but clinic level or higher data should be preserved.) My question is, has anyone done this? Is it valid? It seems logical but feels weird. Does anyone have any experience or thoughts either way? If this is valid, it would probably be the easiest and most powerful option.
2) Suppose we have only clinic level data (60 clinics, 5 time points, outcomes are then rates between 0 and 1). We could leave it as such or even simplify to just 2 time points (pre-intervention, post-intervention) and possibly collapse further by just subtracting (post-pre). In either case, what would be the best way to model these rates? One suggestion was to take the log odds and simply use a linear model (or linear mixed model) on log-odds. Another was to use probit regression (is a mixed version possible?). Another was to use beta regression (or mixed?). Yet another was a quasi-binomial model, which I hadn't heard of previously but sounded interesting. Does anyone have experience with such data and/or know what type of model might be best in these kinds of situations? (That is, collapsed proportions for each of 60 clinics, with at least one, and possibly a handful of additional covariates of interest in this secondary aim.)
Our goal is to come up with a strategy to propose to this group before we meet on Tuesday.
Thanks for reading, and any help on either or both questions would be appreciated!
Naomi
------------------------------
Naomi Brownstein
Associate Professor
Medical University of South Carolina
------------------------------