We could find a distribution to fit. If nothing else, R's logspline package has a very effective semiparametric approach. But why do you want to fit the distribution? What is your research question?

It sounds like relating A1c to other covariates is the main interest. Besides nonparametric methods, which can be difficult to interpret, you could use quantile regression to fit, say, 10%, 50%, and 90% quantiles as a function of covariates. I find this appealing because sometimes in biology we don't have nice mean shifts and constant variance as a function of covariates, but rather a subset of cases that shifts while its complement does not, leading to the distribution changing with covariates. A related possibility is the gamlss family of R packages which allow location, scale, and shape (skewness and kurtosis) to vary smoothly or discretely with predictor variables. I'm still learning about this. It offers a large set of distributions parameterized by location, scale, and one or both of skewness and kurtosis. Distributions are organized by their ranges (unrestricted, >0, and [0, 1]). But some care in the choice of distribution is still required. Also, some distribution in that set could probably be applied to the marginal distribution.

------------------------------

Jim Garrett

------------------------------

Original Message:

Sent: 08-28-2024 16:18

From: Christopher Ryan

Subject: distributions for modeling percent (or proportion) of rather narrow plausible range--specifically, hemoglobin A1c in diabetics

I'm helping a colleague and his team of residents and medical students with a cross-sectional chart review study of hemoglobin A1c levels in patients with type 2 diabetes, and how (if at all) they relate to a variety of "social determinants of health": transportation problems, substance use, various measures of wealth/poverty, and so on.

For those unfamiliar with it, hemoglobin A1c is a measure of what percent of one's hemoglobin molecules are glycosolated---have glucose molecules hooked to them. It's a permanent bond, persisting for the life of that hemoglobin molecule---about 3 months. So it's a measure of long-term diabetes control: lower hemoglobin A1c = better diabetes control, and vice versa.

By rights, hemoglobin A1c is a percentage: can't be less than zero or greater than 100. In clinical reality, values less than about 4 or greater than about 16 are biologically implausible; in 25 years of practice I don't think I've never seen one outside that range.

I've always been intrigued by the beta distribution and keep my eye out for opportunities to use it. It has a theoretical attraction for proportions/percentages. Any experience with, or thoughts about, modeling a percent outcome variable with the beta distribution, when the plausible range is narrow like this?

The log-normal also occurs to me. Using the fitdistrplus package in R, the observed data seem a tad closer to the lognormal theoretical CDF than to the beta or the normal. (I've attached two quick and dirty graphs. These are unconditional plots, or course, with no predictors, so might not be definitive.)

I'd welcome any thoughts. Thanks.

------------------------------

Christopher Ryan

Agency Statistical Consulting, LLC

------------------------------