Discussion: View Thread

  • 1.  Repeated measurements?

    Posted 01-30-2015 02:33
    This message has been cross posted to the following eGroups: Statistical Consulting Section and Biometrics Section .
    -------------------------------------------

    Dear all,
    I fear I have a quite specific question concerning a statistical issue. I would be very happy if you could help me with it.

    I am currently analyzing data which is collected annually via surverys in a large company. Every year around 200 people are asked ca. 50 yes/no questions. Then a binomial test is performed for each question to check whether an answer was given significantly more often than expected. So far so good and legitimate.

    What caught my attention was the fact that - besides doing the yearly analyses - this whole thing was planned to run for 15 years and should result in an overall analysis over all years. The sample size N=3000 was calculated in the beginning and was then spread across the 15 years, which is okay I guess. In my opinion the problem lies in the fact that it was not recorded which employee took part in the test multiple times, which certainly happened quite often.

    So now I think I have partially replicated and partially independent data points, but I cannot even say which are which. This is the point at which I am stuck - I do not really know what to search for to find literature for my situation.

    The initial N was calculated assuming independent samples, which some of them are not, as repeated samples tend to be correlated. On the other hand those samples are not completely similar, as although the same person might have taken part multiple times, he/she was always asked questions about the last year and thus not the very same event on every survey, respectively.

    In the end my question is how I can handle this longitudinal data set if I am unaware of the correlation that might be apparent for a portion of my data points.



    -------------------------------------------
    Kerstin Schmidt
    Director
    BioMath GmbH
    -------------------------------------------


  • 2.  RE: Repeated measurements?

    Posted 01-30-2015 05:01
    Dear Kerstin,
    The finite population appears to be crucial in your situation, together with the fact that you don't know the identity of the units sampled.

    From a design-based inference point of view, samples at different time-points are presumably taken independently (SRS each time?), which implies that for example the corresponding sample means are mutually uncorrelated. If you had known the identities of the sample units, a model-based analysis could possibly have utilized that information, but without that knowledge you probably cannot improve on the design-based analysis.

    Parenthetically, note that many surveys repeated over time deliberately use panels which are (partially) the same, in order to improve inference about population changes over time. 

    -------------------------------------------
    Rolf Sundberg
    Stockholm University
    -------------------------------------------




  • 3.  RE: Repeated measurements?

    Posted 01-30-2015 08:24
    Dear Kerstin,

    First, let me compliment you for describing the problem so clearly.

    You can only use the information you have, so the question is how not knowing which answers came from the same person effects your results. Since the problem also exists with the annual analyses, we might start the discussion there.

    If people don't remember last year's answer, you may have less of a problem, but if they remember, the answers may often be the same. You could check for the expected overlap in two independent samples from this finite population and add some ties. With the conventional binomial test it doesn't change the results (unless you treat the ties as 'inexact' and break them evenly), so you have, in fact, just a smaller sample size.

    Whether a combined analyses makes sense is an open question. One would hope that issues being identified were addressed the next year, so I'm not so sure what the underlying hypothesis is that would be tested and how the results would be interpreted. One would expect these ties reduce both the effect size and the variance, so the impact may be less than the impact of using an arbitrary ("conventional") p-value.

    I hope this helps. 

    -------------------------------------------
    Knut M. Wittkowski
    Head, Dept. Biostatistics, Epidemiology, and Research Design
    Rockefeller University
    -------------------------------------------



  • 4.  RE: Repeated measurements?

    Posted 01-30-2015 14:26
    Hello Dr. Wittkowski,

    Exactly my observation! I was very impressed by the clarity with which Kerstin presented the research problem! Thank you for keeping the conversation open on this very interesting study!

    Patricia

    -------------------------------------------
    Patricia Rodriguez de Gil
    University of South Florida
    -------------------------------------------




  • 5.  RE: Repeated measurements?

    Posted 01-31-2015 12:36
    Hi Kerstin, We need to understand more about the original intent of the longitudinal study - in particular, what questions was it trying to address at the end of the 15 years? (The sample size calculation presumably reflected these questions - if not, it should be treated with caution). Also, from an inferential standpoint, was each yearly survey trying to assess a fixed inferential target or did the target change across years as a result of the company implementing some sort of intervention (e.g., better training programs, salary increases)? Assuming a fixed inferential target across time for this study, if the intent was to track the proportion of participants who responded "yes" to each question over time to see if it changes for the better (or worse) as time progresses, you could collapse the data from each year into a summary (i.e., sample proportion) and then look at the temporal trend in these yearly summaries. Of course, these summaries would be correlated across different time points since they are derived from what we suspect are overlapping sets of employees. But you could look at the yearly summary data for guidance into how to best model the nature and strength of that correlation. Some challenges with this approach include the fact that by collapsing your data to a yearly summary, you lose information and hence power, and that proportions are bounded between 0 and 1 so you will need to consider a model which adequately reflects this (e.g., a linear model applied to a transformation of the response meant to shift the response from the 0 - 1 scale to the real scale). This collapsing exercise might possibly give you some insight into the amount of correlation you would expect to see each year among the raw data values from the same person. Using this insight, you could conduct some sensitivity analyses in a genuine longitudinal data setting where you could assume no correlation for raw observations within the same subject, as well as different amounts of correlation. Based on these analyses, you could develop an understanding of how your inferential conclusions change as a function of the assumed strength of the within-subject correlation. If they don't change very much, then all is fine. Otherwise, you may decide that collapsing the data is your best bet given the circumstances. Or you may ask the company to conduct a 2-year pilot study where the same 200 subjects are asked to complete the survey so that you can get some sort of upper limit on the strength of the within-subject correlation. These are just some ideas worth exploring, though perhaps they may not prove useful to you in the present context. As others have noted, you have an interesting problem which can provide fertile ground for considering creative solutions. Isabella ------------------------------------------- Isabella Ghement Ghement Statistical Consulting Company Ltd. -------------------------------------------


  • 6.  RE: Repeated measurements?

    Posted 01-31-2015 01:12
    How big is this "large company"? If it's large enough, then 200 might be a relatively small numerator by comparison, and the chances of sampling the same person twice over time might be small enough to be ignorable in practice. (Assuming, of course, that every employee has equal chance of being sampled in any one year.)

    -------------------------------------------
    Eric Siegel
    Biostatistician
    Univ of Arkansas for Medical Sciences of Biostatistics
    -------------------------------------------




  • 7.  RE: Repeated measurements?

    Posted 01-31-2015 13:52
    Hi Kerstin,

    I am on the case. It is an interesting predicament, and I don't immediately know the solution, But I will look into it and see what I can come up with.


    jt


    ------Original Message------

    This message has been cross posted to the following eGroups: Statistical Consulting Section and Biometrics Section .
    -------------------------------------------

    Dear all,
    I fear I have a quite specific question concerning a statistical issue. I would be very happy if you could help me with it.

    I am currently analyzing data which is collected annually via surverys in a large company. Every year around 200 people are asked ca. 50 yes/no questions. Then a binomial test is performed for each question to check whether an answer was given significantly more often than expected. So far so good and legitimate.

    What caught my attention was the fact that - besides doing the yearly analyses - this whole thing was planned to run for 15 years and should result in an overall analysis over all years. The sample size N=3000 was calculated in the beginning and was then spread across the 15 years, which is okay I guess. In my opinion the problem lies in the fact that it was not recorded which employee took part in the test multiple times, which certainly happened quite often.

    So now I think I have partially replicated and partially independent data points, but I cannot even say which are which. This is the point at which I am stuck - I do not really know what to search for to find literature for my situation.

    The initial N was calculated assuming independent samples, which some of them are not, as repeated samples tend to be correlated. On the other hand those samples are not completely similar, as although the same person might have taken part multiple times, he/she was always asked questions about the last year and thus not the very same event on every survey, respectively.

    In the end my question is how I can handle this longitudinal data set if I am unaware of the correlation that might be apparent for a portion of my data points.



    -------------------------------------------
    Kerstin Schmidt
    Director
    BioMath GmbH
    -------------------------------------------