Barbara,

I agree with you and Constantine. The flasks are the experimental units, and the cells are the observational units.

I like your idea of a mixed-effects model. I vote for two mixed effects, one for flask within repetition and one for repetition. I think you have to explain to the investigator that the 60,000 data points, although real, are misleading because they are not independent. As Constantine indicated, you have 12 clusters of 5,000 observations per cluster, not 60,000 independent observations. I don't know if this analogy will work on your investigator, but the 12 flasks with 5,000 cells per flask are much like 12 pregnant rats that one treats before they give birth so that one can see what happens to their offspring from the in-utero exposure. Ask the investigator if she can really consider different pups from the same litter to be independent.

However, although I like your idea of a mixed-effects model, all that it addresses will be how the means of the 5,000 cells/flask vary with treatment. But what if the treatment also affects variability? What if some treatments make the cell expressions tightly clustered around their means while other treatments make the cell expressions really spread out? Or how about skewness? What if some treatments make the cell expressions really right-skewed while other treatments make for roughly symmetric cell-expression distributions?

For that reason, I'd like to propose a second approach, the heart of which is this: The 5,000 cells per flask are not just 5,000 observations, they are a distribution. That distribution has not only a mean, but also a standard deviation (SD), a skewness, a kurtosis, and various percentiles of potential interest. Each of which will from flask to flask. My proposal is to summarize the distribution within each flask using the above summary measures, and then to treat each summary measure as a flask-level outcome in the analysis. If you want to, you should be able to use the Repeated statement in the SAS Mixed Procedure to analyze flask means, SDs, skewnesses, etc., as components of a vector-valued outcome.

Did you say that the investigator normalized the flow-cytometry intensities to min-max for each flask? On the one hand, ouch, I wish she hadn't done that, but on the other hand, hmmm, each flask must now have a minimum of zero, a maximum of one, and 4,998 values in between that could maybe be modelled using a beta distribution. Hmmm....

Good luck. It sounds like you have an interesting problem.

------------------------------

Eric Siegel, MS

Biostatistics Project Manager

Department of Biostatistics

Univ. Arkansas Medical Sciences

------------------------------

Original Message:

Sent: 02-21-2024 12:46

From: Barbara Graham

Subject: Single-cell analyses - suggestions requested

I am working with a researcher who has done single-cell flow cytometry. She has 6 flasks with 3 different treatments (each treatment repeated in 2 flasks), and each flask contains 5,000 cells. This set up was then repeated, for 12 total flasks. The flow cytometry intensities were "normalized" to min-max for each flask.

The flasks are obviously (to me) the experimental unit, but I am a bit at odds at the best way to compare intensities between treatments. Is this a case for a mixed model with flask as random variable? The researcher is concerned that this type of analysis wipes out the advantage of single-cell analysis (60000 data points vs 12 EU). The measurements in each flask are very skewed right, and log transformation still results in some heavy tails.

Thoughts? Suggestions? Recommended manuscripts?

Thanks,

------------------------------

Barbara Graham

Biostatistician

Colorado State University

------------------------------