Apologies for the length. Ignore at your leisure :) But I have what I consider an interesting problem involving missing data, and ignoring sources of variation.
I work for a county health department. Obviously we have been consumed by COVID-19 for 28 months now. Much effort has been devoted to ongoing surveillance of disease incidence. Lately we have had less visibility into that, since our state has ceased attempting to call every case and do an interview, and with advent of "kitchen table" tests where the results people get there at home never end up in the state lab database. So our state, like many, has embarked upon a wastewater surveillance system. The incoming flow at the wastewater treatment plant (WWTP) is sampled and undergoes polymerase chain reaction (PCR) assay for SARS-CoV-2 RNA. I've been talking with a number of the people involved so I can understand the data-generating process---the prime directive of statistics. I have some reservations about it, but I'd like to consult the collective wisdom because I could be off the mark.
Once per week at each WWTP, a sampling bottle is connected to the incoming stream. It takes samples periodically over a 24 hour period. That bottle is then sent to the lab. 90% of specimens go to a single lab; the others are scattered among about 3-4 others. At the lab, 3 aliquots from the bottle are taken, and each aliquot is subjected to PCR. The result from each reaction is the number of copies of viral RNA detected. We'll call it copynumber.
Here's where it gets complicated.
The assay's lower limit of quantitation is 5, meaning any values less than 5 cannot be distinguished from one another. (I'm still trying to get clarity how, in that case, they can claim to distinguish 1 copy from 0. I'm not a bench scientist by any means.)
If ALL three values of copynumber exceed 5, the lab reports out the mean of those 3 values; we'll call it avg. They do not report the three individual values of copynumber.
If ANY of the three reactions yield copynumber > 0 but <=5, the lab does not report any value for avg. But they report something was detected, call it detect = TRUE.
If all 3 reactions yield copynumber = 0, this counts as a non-detect, detect is reported as FALSE, and no value of avg is reported out.
Then the state analytical people get involved, and my reservations emerge. It seems they do imputation of a sort, for missing values of avg, but with fixed values.
If detect == FALSE, they set avg to 1. If detect == TRUE but avg is missing (non-quantifiable), they set avg to 3.
Then they calculate "intensity", which is log(avg) divided by log(number of copies of some common fecal virus). That denominator is just to normalize for population and is immaterial to my concerns. They chose to set avg = 1 when detect == FALSE so as to yield intensity values of 0 for those observations.
Lastly they produce boxplots of intensity, using data from all participating WWTPs in the state. I receive them weekly, with a little dot showing where my WWTP compared. They also produce time-series plots of the weekly intensity values for my WWTPs, notionally so I can see trends over time.
None of this really matters when incidence of COVID-19 is high, and all sewage samples have lots of copies of SARS-CoV-2 RNA. But frankly, wastewater surveillance is irrelevant in that situation. The opportunity for wastewater surveillance to shine is in low-to-zero incidence periods, when the hope is it would give an early signal of an impending surge of cases. But that is the very situation in which I worry this analytical approach falls apart. There would be a lot of fixed values of 1 and 3 inserted into the data, and it seems to me that would erase some of the variability in the data, making conclusions about, say, "outlying" or anomalous values from a particular WWTP rather questionable.
It also seems to me that there are many sources of variation, right from the collection bottle through to which labs do the assay and through to the 3 individual values of copynumber, that are going unaccounted for.
Thoughts, from anyone who stuck with me?
Thanks.
--Chris Ryan
------------------------------
Christopher Ryan
Clinical Associate Professor of Family Medicine
------------------------------