Responding to:
How to deal with the old evil '0' which implies a control in such fields. Some standard statistical packages add a scaling or fudge factor (f) to the "dose" metamer so that instead of 0, 1, 10, 100,... etc. one now deals with f, 1+f, 10+f, 100+f, ... etc. Since 'f' is a constant scaling factor, it has no contribution to the variances.
f is not "constant scaling factor," it is an additive factor. While the standard deviation of Y+f, sd(Y+f), does not depend on f, the issue here is about sd(log(Y+f)), which certainly depends on f.
For example:
# Generate X~logNormal, but rounding to 0.1 creates x = 0.00 values.
x <- round(exp(rnorm(100000, 0.1, 1.1)),1)
f <- c(0.00, 0.001, 0.005, 0.01, 0.05, 0.10)
SD <- sd(log(x+f[1]))
for (i in 2:length(f)) { SD <- c(SD, sd(log(x+f[i]))) }
data.frame(f,SD=round(SD,3))
# f SD
# 1 0.000 NaN
# 2 0.001 1.138
# 3 0.005 1.110
# 4 0.010 1.095
# 5 0.050 1.029
# 6 0.100 0.973
Given below is an R function that handles the issue by Winsorizing the X = 0.00 values, only. It is completely data driven, i.e. there is no need to arbitrarily define a value (here, f) for some patch-up repair like log(X + f).
logWin0 <- function(X) {
# Returns log(X) after Winsorizing X = 0.00 values so that log(0.00) is
# set as follows. Let X.u be the unique values of X. Thus, X.u[1] = 0.00; X.u[2]
# < X.u[3] are the next smallest unique values of X. To equalize the spacing
# between log(X.u[1]), log(X.u[2]), and log(X.u[3]), make
# log(X.u[1]) = 2*log(X.u[2]) - log(X.u[3]).
# This is equivalent to Winsorizing X = 0.00 to X' = w*X.u[2], where w =
# X.u[2]/X.u[3] < 1. So, if X.u[2] and X.u[3] are far apart, then X' is closer
# to 0.00. If X.u[2] and X.u[3] are nearly equal, then X' will be slightly
# less than X.u[2]. See examples.
# Ralph O'Brien, 23 December 2018,
obrienralph@gmail.com if (any(X < 0)) { stop("At least one X is negative.") }
logX <- numeric(length(X))
X.u <- unique(X)
if (length(X.u) > 2) {
w <- X.u[2]/X.u[3]
} else { stop("X does not have at least 3 unique values.") }
logX[X!=0] <- log(X[X!=0])
logX[X==0] <- log(w*min(X[X!=0]))
return(logX)
RunExamples <- FALSE
if (RunExamples) {
# Generate 100 X ~ logNormal observations, but rounding to 0.1 creates
# one X = 0 value.
set.seed(170322)
(x <- sort(round(exp(rnorm(100, 0.1, 1.1)),1)))
# [1] 0.0 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.4 0.4 0.4 0.4 0.4
# [14] 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
# [27] 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7
# [40] 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.1 1.1 1.1 1.1 1.1 1.2
# [53] 1.3 1.4 1.4 1.4 1.4 1.5 1.6 1.6 1.6 1.7 1.7 1.8 1.9
# [66] 1.9 1.9 2.0 2.0 2.1 2.4 2.4 2.4 2.4 2.5 2.5 2.5 2.5
# [79] 2.6 2.9 3.1 3.1 3.3 3.5 3.6 3.9 3.9 4.1 4.1 4.3 4.8
# [92] 5.0 5.8 5.9 6.6 7.4 8.2 21.0 21.5 34.8
# Example 1. Use x as generated. Note single X = 0 value. The first positive
# unique value (0.10) is half of the second unique value (X = 0.20), so
# x = 0 is Winsorized to x' = 0.50*0.10 = 0.05. No other values are changed.
log.X <- logWin0(x)
data.frame(X=x[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
# X X.Win logX.Win
# 1 0.0 0.05 -2.995732
# 2 0.1 0.10 -2.302585
# 3 0.1 0.10 -2.302585
# 4 0.2 0.20 -1.609438
# 5 0.2 0.20 -1.609438
# 6 0.2 0.20 -1.609438
# Example 2. Make first two positive x values different but close together.
x. <- x
x.[2] <- 0.90*x.[3]
log.X <- logWin0(x.)
data.frame(X=x.[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
# X X.Win logX.Win
# 1 0.00 0.081 -2.513306
# 2 0.09 0.090 -2.407946
# 3 0.10 0.100 -2.302585
# 4 0.20 0.200 -1.609438
# 5 0.20 0.200 -1.609438
# 6 0.20 0.200 -1.609438
# Example 3. Set first two positive x values different and far apart.
x. <- x
x.[2] <- 0.10*x.[3]
log.X <- logWin0(x.)
data.frame(X=x.[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
# X X.Win logX.Win
# 1 0.00 0.001 -6.907755
# 2 0.01 0.010 -4.605170
# 3 0.10 0.100 -2.302585
# 4 0.20 0.200 -1.609438
# 5 0.20 0.200 -1.609438
# 6 0.20 0.200 -1.609438
} # end RunExamples
} # end logWin0()
------------------------------
Ralph O'Brien
Professor of Biostatistics (officially retired; still keenly active)
Case Western Reserve University
http://rfuncs.weebly.com/about-ralph-obrien.html------------------------------
Original Message:
Sent: 12-22-2018 10:47
From: Ajit Thakur
Subject: A classroom discussion problem or quiz question?
In preclinical sciences (such as toxicology, immunology, biochemistry, etc.) there are a lot of situations arise where investigators and statisticians feel the need for logarithmic transformation of the ordinate, abscissa, or both. There are certain advantages for such transformations during analysis of variance/covariance and regression analysis to determine a dose response (i.e. trend). Here are some of the reasons:
1. Linearization of the dose-response curve for using in course of dose-extrapolation/interpolation. An exponential response can be linearized using the log-logistic or log-probit transformations as is generally done in estimating median lethal/effectic dose and confidence intervals. In biochemistry and some other biological systems one deals with simple linear decays of radioactivity and other particles expressed as a first order linear ordinary differential equation such as dx/dt = - k x(t) which produces an exponential solution x(t) = Cexp(-kt) and after logarithmitizing ln(x(t)) = lnC - kt, where k is the decay rate (a constant) and C is the constant of integration.
2. Getting rid of (or minimizing) heteroscedasticity of error variances which is a requirement for standard univariate analyses.
3. Producing equal or approximately equal spacing of the X-axis using logarithmic transformation of the said axis. Often the design uses geometric or other such unequal spacing (such as 0, 1, 10, 100,... or some such). Equal spacing of the independent variable helps to bring about optimal statistics and is easier to handle programmatically.
There may be other reasons for such transformations. However, there may arise some problems that people do not always point out. For example:
1. How to deal with the old evil '0' which implies a control in such fields. Some standard statistical packages add a scaling or fudge factor (f) to the "dose" metamer so that instead of 0, 1, 10, 100,... etc. one now deals with f, 1+f, 10+f, 100+f, ... etc. Since 'f' is a constant scaling factor, it has no contribution to the variances.
2. If your data actually are derived from a normal (or at least a symmetric) distribution, by logarithmitizing them you are producing a log-normal or some other asymmetric distribution. In other words, although you are producing homoscedasticity of error variances, you are producing a long-tailed (skewed or asymmetric) distribution. It is always an excellent idea to either graphically (test for residuals?) or otherwise test for normality under such transformations.
In summary then, there may not be a magic solution- it is a case-by-case solution.
------------------------------
Ajit K. Thakur, Ph.D.
Retired Statistician
Original Message:
Sent: 12-20-2018 17:42
From: Martha Smith
Subject: A classroom discussion problem or quiz question?
For more on lognormal distributions and more examples where they are appropriate, see https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/LognormalDistributions1.pdf (a handout I used in a summer course for secondary math teachers).
------------------------------
Martha Smith
University of Texas
Original Message:
Sent: 12-17-2018 10:45
From: Ralph O'Brien
Subject: A classroom discussion problem or quiz question?
Saw this on a TV ad and can't resist sharing. At first, the plot is pretty stupid. However, what labeling of the X-axis would make it reasonable? Strong hint: 770/148 = 5.2; 4100/770 = 5.3. Of course, a TV ad would never do that.
------------------------------
Ralph O'Brien
Professor of Biostatistics (officially retired; still keenly active)
Case Western Reserve University
http://rfuncs.weebly.com/about-ralph-obrien.html
------------------------------