ASA Connect

 View Only
  • 1.  A classroom discussion problem or quiz question?

    Posted 12-17-2018 10:45
    Saw this on a TV ad and can't resist sharing. At first, the plot is pretty stupid. However, what labeling of the X-axis would make it reasonable? Strong hint: 770/148 = 5.2; 4100/770 = 5.3. Of course, a TV ad would never do that.

    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------


  • 2.  RE: A classroom discussion problem or quiz question?

    Posted 12-18-2018 13:50
    The X-axis must be logarithmic.

    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org
    ------------------------------



  • 3.  RE: A classroom discussion problem or quiz question?

    Posted 12-19-2018 11:53
    Correct, Emil. Thanks for responding.

    If the X-axis had been labeled, it could have read "Roofing Costs (log-scaling)". With respect to measurement properties, this implies that roofing cost is ratio scaled, meaning that the difference between $x and $2x is the same "true impact" as the difference between $2x and $4x. This is not so if I were paying.

    But it is often justifiable with biological measurements, such as concentrations or counts. (Depends on various factors.) And when used wisely, log transforming variables and then exponentiating the results often leads to summary statements that are easier for everyone to comprehend, remember, and act upon.

    Here's a example based on using the Welch t test. A sentence in the abstract might be, "The geometric mean for group A was 39.2% greater than for Group B; 95% CI: [12.5%, 72.4%]." A little more thought and effort by us leads to more straightforward communications to our investigators and their audiences.


    n <- c(38, 37)
    group <- rep(c("A","B"),n)
    Median1 <- 50  # Any positive value yields same ratio of geometric means.
    TrueEffect <- 1.40  # Group A's geometric mean is 40% greater than group B's.
    RelSpread95 <- 6  # ratio of 97.5 and 0.025 quantiles for Y ~ logNormal.
    SD.logY <- log(RelSpread95)/(1.96*2)  # SD of log(Y) ~ Normal
    set.seed(170322)
    Y = exp(c(rnorm(n[1], log(Median1*TrueEffect), SD.logY),
                    rnorm(n[2], log(Median1), SD.logY)))
    (Welcht <- t.test(log(Y) ~ group))
    # Welch Two Sample t-test

    # data:  log(Y) by group
    # t = 3.0919, df = 69.573, p-value = 0.002861
    # alternative hypothesis: true difference in means is not equal to 0
    # 95 percent confidence interval:
    #  0.1174653 0.5445459
    # sample estimates:
    # mean in group A mean in group B 
    #        4.307968        3.976962

    # ratio of geometric means
    exp(Welcht$estimate[1] - Welcht$estimate[2])
    # 1.392368

    # 95% CI for ratio of geometric means
    exp(Welcht$conf.int)
    # [1] 1.124643 1.723825


    ​​​​​

    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------



  • 4.  RE: A classroom discussion problem or quiz question?

    Posted 12-19-2018 12:23
      |   view attached
    Log-scaling is also common when looking at polymer molecular weight distributions.  It also means that terms like "bimodal" can be ambiguous unless one specifies what sort of scale is used for the X-axis (see

    "Modality of Molecular Weight Distributions", Emil M Friedman, Polymer Engineering and Science, 30, 569 (1990).  http://dx.doi.org/10.1002/pen.760301002, attached).



    ------------------------------
    Emil M Friedman, PhD
    emilfriedman@gmail.com
    http://www.statisticalconsulting.org
    ------------------------------

    Attachment(s)



  • 5.  RE: A classroom discussion problem or quiz question?

    Posted 12-20-2018 17:42
    For more on lognormal distributions and more examples where they are appropriate, see https://web.ma.utexas.edu/users/mks/ProbStatGradTeach/LognormalDistributions1.pdf (a handout I used in a summer course for secondary math teachers).

    ------------------------------
    Martha Smith
    University of Texas
    ------------------------------



  • 6.  RE: A classroom discussion problem or quiz question?

    Posted 12-22-2018 10:48
    In preclinical sciences (such as toxicology, immunology, biochemistry, etc.) there are a lot of situations arise where investigators and statisticians  feel the need for logarithmic transformation of the ordinate, abscissa, or both.  There are certain advantages for such transformations during analysis of variance/covariance and regression analysis to determine a dose response (i.e. trend).  Here are some of the reasons:

    1. Linearization of the dose-response curve for using in course of dose-extrapolation/interpolation.  An exponential response can be linearized using the log-logistic or log-probit transformations as is generally done in estimating median lethal/effectic dose and confidence intervals.  In biochemistry and some other biological systems one deals with simple linear decays of radioactivity and other particles expressed as a first order linear ordinary differential equation such as dx/dt = - k x(t) which produces an exponential solution x(t) = Cexp(-kt) and after logarithmitizing ln(x(t)) = lnC - kt, where k is the decay rate (a constant) and C is the constant of integration.
    2. Getting rid of (or minimizing) heteroscedasticity of error variances which is a requirement for standard univariate analyses.
    3. Producing equal or approximately equal spacing of the X-axis using logarithmic transformation of  the said axis.  Often the design uses geometric or other such unequal spacing (such as 0, 1, 10, 100,... or some such).  Equal spacing of the independent variable helps to bring about optimal statistics and is easier to handle programmatically.

    There may be other reasons for such transformations.  However, there may arise some problems that people do not always point out.  For example:

    1. How to deal with the old evil '0' which implies a control in such fields.  Some standard statistical packages add a scaling or fudge factor (f) to the "dose" metamer so that instead of 0, 1, 10, 100,... etc. one now deals with f, 1+f, 10+f, 100+f, ... etc. Since 'f' is a constant scaling factor, it has no contribution to the variances.
    2. If your data actually are derived from a normal (or at least a symmetric) distribution, by logarithmitizing them you are producing a log-normal or some other asymmetric distribution.  In other words, although you are producing homoscedasticity of error variances, you are producing a long-tailed (skewed or asymmetric) distribution.  It is always an excellent idea to either graphically (test for residuals?) or otherwise test for normality under such transformations.

    In summary then, there may not be a magic solution- it is a case-by-case solution.

    ------------------------------
    Ajit K. Thakur, Ph.D.
    Retired Statistician
    ------------------------------



  • 7.  RE: A classroom discussion problem or quiz question?

    Posted 12-23-2018 09:14
    Responding to:

    How to deal with the old evil '0' which implies a control in such fields.  Some standard statistical packages add a scaling or fudge factor (f) to the "dose" metamer so that instead of 0, 1, 10, 100,... etc. one now deals with f, 1+f, 10+f, 100+f, ... etc. Since 'f' is a constant scaling factor, it has no contribution to the variances.

    f is not "constant scaling factor," it is an additive factor. While the standard deviation of Y+f, sd(Y+f), does not depend on f, the issue here is about sd(log(Y+f)), which certainly depends on f.

    For example:

    # Generate X~logNormal, but rounding to 0.1 creates x = 0.00 values.
    x <- round(exp(rnorm(100000, 0.1, 1.1)),1)
    f <- c(0.00, 0.001, 0.005, 0.01, 0.05, 0.10)
    SD <- sd(log(x+f[1]))
    for (i in 2:length(f)) { SD <- c(SD, sd(log(x+f[i]))) }
    data.frame(f,SD=round(SD,3))
    #             f         SD
    # 1   0.000      NaN
    # 2   0.001    1.138
    # 3   0.005    1.110
    # 4   0.010    1.095
    # 5   0.050    1.029
    # 6   0.100    0.973

    Given below is an R function that handles the issue by Winsorizing the X = 0.00 values, only. It is completely data driven, i.e. there is no need to arbitrarily define a value (here, f) for some patch-up repair like log(X + f).

    logWin0 <- function(X) {
    # Returns log(X) after Winsorizing X = 0.00 values so that log(0.00) is
    # set as follows. Let X.u be the unique values of X. Thus, X.u[1] = 0.00; X.u[2]
    # < X.u[3] are the next smallest unique values of X. To equalize the spacing
    # between log(X.u[1]), log(X.u[2]), and log(X.u[3]), make
    #                        log(X.u[1]) = 2*log(X.u[2]) - log(X.u[3]).

    # This is equivalent to Winsorizing X = 0.00 to X' = w*X.u[2], where w =
    # X.u[2]/X.u[3] < 1. So, if X.u[2] and X.u[3] are far apart, then X' is closer
    # to 0.00. If X.u[2] and X.u[3] are nearly equal, then X' will be slightly
    # less than X.u[2]. See examples.

    # Ralph O'Brien, 23 December 2018, obrienralph@gmail.com

    if (any(X < 0)) { stop("At least one X is negative.") }
    logX <- numeric(length(X))
    X.u <- unique(X)
    if (length(X.u) > 2) {
    w <- X.u[2]/X.u[3]
    } else { stop("X does not have at least 3 unique values.") }
    logX[X!=0] <- log(X[X!=0])
    logX[X==0] <- log(w*min(X[X!=0]))
    return(logX)
    RunExamples <- FALSE
    if (RunExamples) {
    # Generate 100 X ~ logNormal observations, but rounding to 0.1 creates
    # one X = 0 value.
    set.seed(170322)
    (x <- sort(round(exp(rnorm(100, 0.1, 1.1)),1)))
    # [1] 0.0 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.4 0.4 0.4 0.4 0.4
    # [14] 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
    # [27] 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7
    # [40] 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.1 1.1 1.1 1.1 1.1 1.2
    # [53] 1.3 1.4 1.4 1.4 1.4 1.5 1.6 1.6 1.6 1.7 1.7 1.8 1.9
    # [66] 1.9 1.9 2.0 2.0 2.1 2.4 2.4 2.4 2.4 2.5 2.5 2.5 2.5
    # [79] 2.6 2.9 3.1 3.1 3.3 3.5 3.6 3.9 3.9 4.1 4.1 4.3 4.8
    # [92] 5.0 5.8 5.9 6.6 7.4 8.2 21.0 21.5 34.8

    # Example 1. Use x as generated. Note single X = 0 value. The first positive
    # unique value (0.10) is half of the second unique value (X = 0.20), so
    # x = 0 is Winsorized to x' = 0.50*0.10 = 0.05. No other values are changed.
    log.X <- logWin0(x)
    data.frame(X=x[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
    #         X  X.Win       logX.Win
    # 1   0.0    0.05   -2.995732
    # 2   0.1    0.10   -2.302585
    # 3   0.1    0.10   -2.302585
    # 4   0.2    0.20   -1.609438
    # 5   0.2    0.20   -1.609438
    # 6   0.2    0.20   -1.609438

    # Example 2. Make first two positive x values different but close together.
    x. <- x
    x.[2] <- 0.90*x.[3]
    log.X <- logWin0(x.)
    data.frame(X=x.[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
    #         X     X.Win       logX.Win
    # 1   0.00   0.081   -2.513306
    # 2   0.09   0.090   -2.407946
    # 3   0.10   0.100   -2.302585
    # 4   0.20   0.200   -1.609438
    # 5   0.20   0.200   -1.609438
    # 6   0.20   0.200   -1.609438

    # Example 3. Set first two positive x values different and far apart.
    x. <- x
    x.[2] <- 0.10*x.[3]
    log.X <- logWin0(x.)
    data.frame(X=x.[1:6], X.Win=exp(log.X[1:6]), logX.Win=log.X[1:6])
    #           X   X.Win       logX.Win
    # 1   0.00   0.001   -6.907755
    # 2   0.01   0.010   -4.605170
    # 3   0.10   0.100   -2.302585
    # 4   0.20   0.200   -1.609438
    # 5   0.20   0.200   -1.609438
    # 6   0.20   0.200   -1.609438
       } # end RunExamples
    } # end logWin0()
    ​​

    ------------------------------
    Ralph O'Brien
    Professor of Biostatistics (officially retired; still keenly active)
    Case Western Reserve University
    http://rfuncs.weebly.com/about-ralph-obrien.html
    ------------------------------