ASA Connect

 View Only
  • 1.  Hypothesis testing

    Posted 05-22-2020 13:43
    Hello,

    I have a database of about 30,000 women who have tracked their weight throughout pregnancy. I have other information such as age and bmi group (BMI before pregnancy). The ages are group into 5 years bracket. So in total there 5 age groups and 4 bmi groups. The number of observations in the groups is uneven.

    I want to test if there are statistical differences between the weight difference in the age/BMI group. My null hypothesis is the mean of the weight difference is the same across 5 age groups. I am currently using ANOVA in R. This test shows me a message "Estimated effects may be unbalanced"  

    Am I using the correct test? Is there any other test that works best with the unbalanced data?


    Thank you,
    Tanaya

    ------------------------------
    Tanaya Kavathekar
    George Washington University
    ------------------------------


  • 2.  RE: Hypothesis testing

    Posted 05-25-2020 06:55
    Which SS type did you select - II or III? Type II provides adjusted main effects similar to running SS(A |B) and SS(B|A) in regression. With unequal n's in the cells of a two-way ANOVA, there is no unique analysis for the main effects (they are confounded). Check on-line for info on Type I to III SS in ANOVA.

    ------------------------------
    Chauncey Dayton
    ------------------------------



  • 3.  RE: Hypothesis testing

    Posted 05-26-2020 08:23
    Tanaya

    This post on Stackoverflow may be relevant to you.

    https://stats.stackexchange.com/questions/76640/estimated-effects-may-be-unbalanced-message-when-running-aov-in-r-what-does-i

    Blaise

    ------------------------------
    Blaise Egan
    Lead Data Scientist
    British Telecommunications PLC
    ------------------------------



  • 4.  RE: Hypothesis testing

    Posted 05-26-2020 12:05
      |   view attached
    The adjustment implied by "Type II provides adjusted main effects similar to running SS(A |B) and SS(B|A)..." is often misinterpreted.  What is really tested by both SS(A|B) and AA(B|A) are the marginal means from a table of AB cell means ADJUSTED to have no AB interaction!

    For details see Turner, David L. (1990) 'An easy way to tell what you are testing in analysis of variance', Communications in Statistics - Theory and Methods, 19:12, 4807 - 4832
    URL: http://dx.doi.org/10.1080/03610929008830475


    ------------------------------
    David L. Turner
    USDA Forest Service Research Statistician, Retired
    ------------------------------



  • 5.  RE: Hypothesis testing

    Posted 05-25-2020 08:08
    Hello Tanaya,

    I would suggest using lm() (i.e., regression) instead.

    Robert

    ------------------------------
    Robert O'Brien
    ------------------------------



  • 6.  RE: Hypothesis testing

    Posted 05-25-2020 08:29
    Balance is not required for estimation or testing of these effects. Balance achieves the minimum standard errors, but it might be impractical or impossible to achieve.

    Your data suggests that a two-way ANOVA with an interaction term is possible. This analysis is easily performed with R.

    ------------------------------
    Mark Bailey
    Principal Analytical Training Consultant
    SAS Institute, Inc.
    ------------------------------



  • 7.  RE: Hypothesis testing

    Posted 05-25-2020 10:32
    Hello Tanaya,

    I rather have questions than answers:
    1. Why would you do a hypothesis testing?
         a. I doubt that the sample is random.
         b. It seems to me that it was not a planned situation.
         c. What would be the target population for the intended inference?

    2. I would rather calculate some descriptive statistics.
         a. With 30 averages(I suppose that you are taking the average for each woman; otherwise your degrees of freedom will be inflated) per woman and 20 groups (5x4) there are, on the average, 1500 women. With those data you can have an idea about the corresponding population distribution.
         b. Most probably there are other grouping factors, ethnicity, etc. So that you will have still some mixture  in your 20 groups.

    Regards,

    Rene Valverde-Ventura

    ------------------------------
    Rene Valverde-Ventura
    ------------------------------



  • 8.  RE: Hypothesis testing

    Posted 05-25-2020 10:47
    Tanaya,
    Yes on the SS calculation recommendations offered by Chauncey. But with your sample size any tiny difference will be significant so I doubt "significant" differences will tell you anything beyond that you have N=30,000. Maybe consider mean (or median, depending on how weight is distributed) values by condition with bootstrapped CIs.
    Also, BMI is calculated from weight so will be a very good predictor of weight 9 months later; not sure what the BMI effect in your ANOVA model would mean. 
    Bruce

    ------------------------------
    Evan Blaine, PhD, PStat
    Statistics Program Director
    St. John Fisher College
    Rochester NY
    ------------------------------



  • 9.  RE: Hypothesis testing

    Posted 05-25-2020 12:39
    It is not clear from your e-mail what type of ANOVA you are doing in R.  Hopefully it is not a single classification ANOVA, but a factorial classification ANOVA and you are testing not the simple main effects but also the interaction (Age Group x BMI index).  The imbalance warning you are getting is more towards this interaction term in your model.  It is simply pointing out that the imbalance in your data may produce F-statistics that may not be exactly F-distributed.  With your large sample sizes, I would not be too worried about it.  In real world data you may never have sets that are completely balanced.  Also, in well known packages such as BMDP, SAS, SPSS, and many others (I am sure both R+ and S+) have procedures that have pseudo F-statistics that may be better behaved than the classical F-statistics.  This may be specially true if you also have repeated measures in your model (do you?).  You have two sources of imbalance.  One is in the Age factor and the other is in the BMI index.   These two factors between themselves are imbalanced (5 and 4).  Is it possible to have equal number of levels for both of them by regrouping your data? If you can, that will minimize your source of imbalance.  Unequal sample sizes within each factor is going to be a fact of real life but given such large sample sizes, the impact of the imbalance will be minimal. There may be other statisticians with other solutions.  In any case, please specify your model statement and what kind of post hoc hypothesis you are testing for the community to make other suggestions.  One further point- if are collaborating with clinicians in your research, I would use the simplest yet a defendable model so that there will be no ambiguity among your collaborators.  Hope it helps.

    Ajit K. Thakur, Ph.D.
    Retired Statistician

    ------------------------------
    Ajit K. Thakur, Ph.D.
    Retired Statistician
    ------------------------------