Discussion: View Thread

  • 1.  SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 11:59
    Apologies for the length. Ignore at your leisure :)   But I have what I consider an interesting problem involving missing data, and ignoring sources of variation.

    I work for a county health department. Obviously we have been consumed by COVID-19 for 28 months now. Much effort has been devoted to ongoing surveillance of disease incidence. Lately we have had less visibility into that, since our state has ceased attempting to call every case and do an interview, and with advent of "kitchen table" tests where the results people get there at home never end up in the state lab database. So our state, like many, has embarked upon a wastewater surveillance system. The incoming flow at the wastewater treatment plant (WWTP) is sampled and undergoes polymerase chain reaction (PCR) assay for SARS-CoV-2 RNA. I've been talking with a number of the people involved so I can understand the data-generating process---the prime directive of statistics. I have some reservations about it, but I'd like to consult the collective wisdom because I could be off the mark.

    Once per week at each WWTP, a sampling bottle is connected to the incoming stream. It takes samples periodically over a 24 hour period. That bottle is then sent to the lab. 90% of specimens go to a single lab; the others are scattered among about 3-4 others. At the lab, 3 aliquots from the bottle are taken, and each aliquot is subjected to PCR. The result from each reaction is the number of copies of viral RNA detected. We'll call it copynumber. 

    Here's where it gets complicated.

    The assay's lower limit of quantitation is 5, meaning any values less than 5 cannot be distinguished from one another. (I'm still trying to get clarity how, in that case, they can claim to distinguish 1 copy from 0. I'm not a bench scientist by any means.)

    If ALL three values of copynumber exceed 5, the lab reports out the mean of those 3 values; we'll call it avg. They do not report the three individual values of copynumber.

    If ANY of the three reactions yield copynumber > 0 but <=5, the lab does not report any value for avg. But they report something was detected, call it detect = TRUE.

    If all 3 reactions yield copynumber = 0, this counts as a non-detect, detect is reported as FALSE, and no value of avg is reported out.

    Then the state analytical people get involved, and my reservations emerge. It seems they do imputation of a sort, for missing values of avg, but with fixed values.

    If detect == FALSE, they set avg to 1. If detect == TRUE but avg is missing (non-quantifiable), they set avg to 3.

    Then they calculate "intensity", which is log(avg) divided by log(number of copies of some common fecal virus). That denominator is just to normalize for population and is immaterial to my concerns. They chose to set avg = 1 when detect == FALSE so as to yield intensity values of 0 for those observations.

    Lastly they produce boxplots of intensity, using data from all participating WWTPs in the state. I receive them weekly, with a little dot showing where my WWTP compared. They also produce time-series plots of the weekly intensity values for my WWTPs, notionally so I can see trends over time.

    None of this really matters when incidence of COVID-19 is high, and all sewage samples have lots of copies of SARS-CoV-2 RNA. But frankly, wastewater surveillance is irrelevant in that situation. The opportunity for wastewater surveillance to shine is in low-to-zero incidence periods, when the hope is it would give an early signal of an impending surge of cases. But that is the very situation in which I worry this analytical approach falls apart. There would be a lot of fixed values of 1 and 3 inserted into the data, and it seems to me that would erase some of the variability in the data, making conclusions about, say, "outlying" or anomalous values from a particular WWTP rather questionable.

    It also seems to me that there are many sources of variation, right from the collection bottle through to which labs do the assay and through to the 3 individual values of copynumber, that are going unaccounted for.

    Thoughts, from anyone who stuck with me?

    Thanks.

    --Chris Ryan

    ------------------------------
    Christopher Ryan
    Clinical Associate Professor of Family Medicine
    ------------------------------


  • 2.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 12:15
    Chris,
    I'm with you in the concerns about the copynumber values below 5.  I think your request for how they distinguish 1 from 0 is appropriate.  My inclination would be:
    1) Check existing literature to understand the <=5 range
    2) Consider requesting a set of data with the actual values, ideally historical values pre home-COVID test availability, where good incidence rates for the same region and time frame are known.  Confirm if there really is no info by values <5.

    If the data cannot answer the question, this should be made known to the supervisors so that resources are not wasted.

    Shalom,
    Jason Wilson
    Chair, Mathematics and Computer Science Department
    Director, Quantitative Consulting Center | www.biola.edu/qcc
    Associate Professor of Statistics |  (562) 944-0351 x5145 







  • 3.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 12:30
    Hello,
    I am not sure what their numbers represent. PCR typically results first in a ct value which is the number of PCR cycles to detect the molecule being present. The lower the concentration of the molecule, the more PCR cycles are needed to detect that the molecule is present. I assume that they have a standard curve that relates ct values to the numbers that get reported, with the lower limit of detection being 5. I assume that they are confident in the difference between 0 and 1 copies based on the number of PCR cycles they run for every sample. The more PCR cycles that are done, the more confident one would be that the molecule is not present if not detected. Based on the my quick look at the literature, it is likely to be ~40 cycles. I hope this helps clarify as best I can for what the numbers "represent." The critical thing for all PCR is the ct value.

    With regard to analysis approach when the concentrations are low, you would really have categorical data since the number of observations with numbers that have actual quantitative meaning will be small: 0 = not detected, 1 = detected, and 2 = detected & quantifiable. What one would care about is proportion of samples being a 1 or 2.

    Let me know if I did not understand the questions you had.

    Cheers,

    ------------------------------
    Robert Podolsky
    Biostatistician
    ------------------------------



  • 4.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 17:18
    Thanks Robert. That is the understanding I have of PCR as well.

    I think the lab is "doing the right thing" by not reporting any quantitative value of copynumber if it appears to be less than their limit of quantitation.  But I worry about what happens next, analytically--- imputing a constant value of 3 for all those results. It's not a constant 3; it is somewhere between 1 and 5. Seems like multiple imputation, e.g. from a U(1,5) distribution, would be a more sound way to proceed. Especially if regions in the state are going to be compared to the statewide distribution (which would be falsely narrow). And I agree that it might also be better, especially in a low, near-zero incidence environment, to consider just 3 possible outcomes, which I'd treat as ordinal: non-detect, non-quantifiable detect, and quantifiable detect.

    ------------------------------
    Christopher Ryan
    Clinical Associate Professor of Family Medicine
    ------------------------------



  • 5.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-09-2022 07:36
      |   view attached
    Chris R,

    With respect to the analysis of the data, your logic is going in the right direction.  You should not substitute the value of 3 in the analysis for data >0 and <5.  This will definitely bias the results.  Imputation is an approach often used in the literature but results may be subject to bias depending on the assumptions of the imputation.  Yet, you may not need to convert the data to three categories either.  You will lose information that way.  In addition, if you want to compare across localities in the categorical approach, you need to insure that operational definitions of the categories are the same across locations.

    In my past life as a colloaborative statistician in a material science company, we dealt with this type of censored data often.  Of course, you want to use methods accepted in your area of application and not material science.   Here are a few resources (one link, one attachment) that discuss estimation in data sets with values below the LLOQ in a practical way from a biomedical perspective.  Disclaimer:  Additional research is needed as "Google" may not be providing the top/latest scholarly approaches.  

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4448983/

    Also, given that you have a "0" value as well, zero inflated methods in combination with imputation or censored data estimation may also be appropriate depending on the number of zeros.  Good luck with the analysis!

    Best,
    Jen

    ------------------------------
    Jennifer Van Mullekom, PhD
    Director of SAIG
    Associate Professor of Practice
    Department of Statistics
    Virginia Tech
    vanmuljh@vt.edu
    ------------------------------

    Attachment(s)

    pdf
    DH05.pdf   524 KB 1 version


  • 6.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 13:54
    Hi, thank you for asking the question
    . If I understand the background.  you have a wastewater dataset from your county. CDC also can provide  wastewater datasets on request for the U.S.(a few weeks ago I requested a wastewater dataset from CDC).   As to the "lower limit of quantitation" (LLOQ or LLQ), you can likely find the answer to that in the background documents for CLIA (clinical laboratory improvements amendments) which is produced through CMS (medicare) / CDC. FDA has input (my limited understanding FDA is not directly responsible for CLIA)  but FDA does participate in establishing the standards.

    CDC
    https://www.cdc.gov/clia/law-regulations.html   
    CMS
    https://www.cms.gov/Regulations-and-Guidance/Legislation/CLIA.
      FDA
    https://www.fda.gov/medical-devices/ivd-regulatory-assistance/clinical-laboratory-improvement-amendments-clia

    There are a series of manuals/guidelines about CLIA standards, and I'm unable to find the links at the moment. I do vaguely recall reading a technical document about LLOQ
    • A caveat: I don't want to send you on a wild goose chase, I believe CLSI has links to the standards manuals (these can be expensive $$)
    https://www.clsi.org/standards/products/method-evaluation/documents/ep19/

    ------------------------------
    Chris Barker, Ph.D.
    2022 Statistical Consulting Section
    Chair-elect
    Consultant and
    Adjunct Associate Professor of Biostatistics
    www.barkerstats.com


    ---
    "In composition you have all the time you want to decide what to say in 15 seconds, in improvisation you have 15 seconds."
    -Steve Lacy
    ------------------------------



  • 7.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 06-08-2022 17:45

    Thanks Chris.  I found this CDC page that purports to offer downloadable data. Sadly, it seems to be far from raw data; rather a highly pre-processed version that presents some interpretations already baked in. 

    https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6

    Is this where you obtained your dataset?



    ------------------------------
    Christopher Ryan
    Clinical Associate Professor of Family Medicine
    ------------------------------



  • 8.  RE: SARS-CoV-2 wastewater surveillance, missing data, imputation, and sources of variation

    Posted 07-17-2023 13:04

    Hi all. I'm resurrecting this old thread, as there is now a related manuscript (in preprint) using wastewater surveillance to predict in advance (or at least improve the prediction) of COVID-19 hospital admission rate.

    https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4498418

    I can certainly see the value in the concept: with increasing sophistication and sensitivity of molecular methods, wastewater can be collected and assayed for pathogens' nucleic acids  relatively easily,  and it does not depend on human behavior. For example, you can test sewage influent for SARS-CoV-2 RNA regularly, regardless of whether people in the community are getting tested.

    I'd be interested in any opinions about the modeling methods--aside from the issues of handling non-quantifiable values as previously discussed. There's a lot here, and I'm still working through the manuscript, but my initial, vague thoughts, possibly misguided, are:

    1. It's a lot of model-fitting. Over-fitting?
    2. The researchers say they used stepwise methods for variable selection
    3. I'd think there'd be much collinearity between some of the predictors. Effect on stability of coefficient estimates?

    Thanks.
    --Chris Ryan



    ------------------------------
    Christopher Ryan
    Clinical Associate Professor of Family Medicine
    SUNY Upstate Clinical Campus
    ------------------------------