Discussion: View Thread

Correlation Between X and Y, where each Y has multiple X values

  • 1.  Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 12:40
    We are working on the problem of calculating the correlation between X and Y, where each Y (n=20) has ten X values associated with it.  A simple answer would be to take the mean of the ten X values and use the 20 pairs to find the correlation.  Is there a more sophisticated way to do this that would retain more of the information in the X values?

    For those of you who want more information, Y = nurse satisfaction and X = patient perception of nurse care for each of ten patients cared for by a given nurse.



    -------------------------------------------
    DeAnne Grunden
    Beverly Grunden
    Statistical Consultant
    Wright State University
    -------------------------------------------


  • 2.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 12:48

    You might consider replicating the Y values 10 times and associate with each of the ten X values.


    -------------------------------------------
    Dr. N. Shirlene Pearson
    Statistical Consultant & Research Support Specialist
    Southern Methodist University
    Dallas, Texas USA
    -------------------------------------------








  • 3.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 12:53

    You could try a repeated measures ANOVA.
    -------------------------------------------
    Barbara Elashoff
    Director
    Myraqa
    -------------------------------------------








  • 4.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:08

    Are the ten patients the same for each nurse?  I wouldn't expect this to be the case, but if so, you could simply run a multiple regression of Y on the X's (where Xi = patient i perception), and use the positive square root of the model R^2 .  This is the correlation of the observed Y's with the model predicted Y's (which itself is a weighted function of the X's).

    -------------------------------------------
    Michael Hughes
    Manager, Statistical Consulting Center
    Miami University
    -------------------------------------------








  • 5.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-25-2011 15:43

    I second this, since it could be viewed as something like

    \frac
    {{\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^J {(y_i - \bar y)(x_{ij} - \bar x)} } }}
    {{J \cdot n \cdot SD(y) \cdot SD(x)}}

    following the idea of correlation coefficient.

    1st time to post, is that right?



    -------------------------------------------
    Xiang Lu
    UCLA Biostatistics
    -------------------------------------------








  • 6.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-25-2011 15:48
    What a bad idea! Did anyone read my explanation why this is a terrible approach?

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 7.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 12:55
    Beverly,

    Replicating the Y's 10 times for each set of 10 X observations and finding the correlation between the 200 values of X and the 200 values of Y should work.  Not sure what the properties of the estimator are.

    Margot

    -------------------------------------------
    Margot Tollefson
    Owner
    Vanward Statistical Consulting
    -------------------------------------------








  • 8.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:06


    -------------------------------------------
    Sheela Talwalker, Ph.D.
    T'Walker Consulting
    -------------------------------------------
    As I understand, you have 20 observations on each Y and corresponding 10 dimensional X vector.
    You could consider doing principal component analysis for X and then select the first principal component with maximum variance to represent X vector. So that you will have a single independent variable, representing patient's perception for each Y. You can then calculate the correlation.
    Another choice is, to calcuate the multiple correlation coefficient between Y and the 10 dimensional vector X.

    Both methods will involve lot of calculations and assumptions of normality of observations and nonsingularity of the variance co-variance matrix.






  • 9.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:09

    Hi Beverly,

    I wish to add that if the sample is such that the nurses come from different departments in the hospital, you may subset the data and use weighted means (especially is there are unequal representations). Thus, you would need to know a nurse's department. I am thinking of a situation where the level of satisfaction depends on the nurses' area of work.

    Hope this helps.

    -------------------------------------------
    Edwin Ndum
    Research Associate
    ACT, Inc
    -------------------------------------------








  • 10.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:19

    I might be not understanding the problem but wouldn't you need to define the parameter of interest under some conceptual model that generates such data, and then (and only then) find a good estimator?

    -------------------------------------------
    James Baldwin
    Station Statistician
    US Forest Service
    -------------------------------------------








  • 11.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:20
    Sorry if this is a repeat, my connection suddenly disappeared.

    Anyways, think about what question you want to answer.

    Correlating Y and X-bar gives you the correlation between the nurse and the mean patient satisfaction.
    In this case you don't have to worry about the correlation structure of the patients.

    I'm not sure what question is being answered by correlating the Y with the individual X's,
    but because of the nested structure of the X's, this has to be done carefully.
    Replicating the Y's 10 times will change the structure of the problem, unless you make
    some very strong assumptions that I expect will affect your conclusions.

    You might be interested in estimating the mean and the variability of the patient responses
    among nurses.  Which would make the correlation not a good answer to this quesiton.

    I hope this helps; I often find that refining the question helps me determine the
    appropriate method.

    Ray

    -------------------------------------------
    Raymond Hoffmann
    Associate Professor
    Medical College of Wisconsin
    -------------------------------------------








  • 12.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:25
    Hi Beverly,

    Are the Y variables an average overall satisfaction score from the 10 different patients who rated that specific nurse?


    -------------------------------------------
    Kenita Hall
    -------------------------------------------








  • 13.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:37

    Thank you for all your responses. 

    Clarification requested:  The Y values are the satisfaction scores - each nurse completed one questionnaire about his/her overall satisfaction with the job, supervisor, etc.  Each of the nurse's 10 patients filled out a questionnaire about the care he or she received from that nurse.  It is of interest to see if the level of satisfaction felt by the nurse is correlated to the level of care received by the patient.  The patient care scores are nested within nurse satisfaction scores.


    I hope this helps.  Thank you.
    -------------------------------------------
    Beverly Grunden
    Statistical Consultant
    Wright State University
    -------------------------------------------








  • 14.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:42
    Sounds more like the patient care scores are the Y and nurse satisfaction is the X.

    I would do a repeated measures anova modeling mean patient care scores as a function of nurse satisfaction, adjusting standard error estimates based on the intra-nurse correlation among measurements.

    -------------------------------------------
    Jarrod Dalton
    Biostatistician
    Cleveland Clinic Foundation
    -------------------------------------------








  • 15.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 13:42
    Sounds like you need to do a regression in that case. One of the previous responses mentioned a repeated measures ANOVA, which I think would be essentially the same thing in this case.

    In my opinion, you don't want to just replicate the Y's 10 times, because that would be artificially inflating the sample size of the data, or something along those lines.

    -------------------------------------------
    Gabriel Farkas
    -------------------------------------------








  • 16.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 14:28
    We still need a little more information. Why are there 10 care scores per patient? Is that the same type of care measured 10 times per patient  (e.g., on 10 separate days) or is it, for example, the care score in each of ten different areas of care--such as delivering medications, changing dressings, etc.? 

    Thank you.

    Nayak

    -------------------------------------------
    Nayak Polissar
    Consultant
    The Mountain Whisper Light
    -------------------------------------------








  • 17.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 14:39
    Beverly,

    I think you should just take the average of the X values within a nurse, then correlate X-bar with Y.
    Y is the nurse's overall satisfaction with her job, integrated (in her mind) over all patients she has seen recently.  Y is not her satisfaction with a particular patient interaction.  I would frame the question as how strong is the correlation between the nurse's job satisfaction and the average patient satisfaction with care, averaged over the population of all patients the nurse has cared for recently.  You don't know the mean for that population, but you can estimate it from the sample of 10 patients from that population.  There is of course some measurement error because X-bar is only an estimate of the true mean, but I would not worry about that - there is also measurement error in Y which could be bigger.

    -------------------------------------------
    Kevin Cain
    Univ of Washington-Seattle
    -------------------------------------------








  • 18.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 15:20
    The answer below is my favorite.  It makes a lot of sense, and is simple, and easy to explain.  Given the limited understanding we have about the data and the situation, the method below sounds great.

    -------------------------------------------
    Jonathan Gatlin
    SAS Institute
    JMP Division
    -------------------------------------------








  • 19.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 14:58
    I think you could easily treat this as a multilevel model (nurses are level 2 units, patients are level 1 units). The outcome is a patient-level variable, so you would look at using nurse satisfaction as a Level 2 predictor of the random intercept.



    -------------------------------------------
    Steven Pierce
    Associate Director
    Center for Statistical Training and Consulting, Michigan State University
    -------------------------------------------








  • 20.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 15:10

    If I understand the data set-up, multiple Y's and multiple X's, then consider a canonical correlation and/or a principal components analysis

    -------------------------------------------
    Christopher Barker
    Statistical Planning and Analysis Services, Inc.
    -------------------------------------------








  • 21.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 15:21
    Have you considered looking at the correlation between the satisfaction levels and a measure of the dispersion of the x variables?   Maybe a regression with Y against the median X and the lower semideviation of each group of Xs would be informative.  (In terms of other variables, is there a measure of how long each patient experienced care from the given nurse? How long the total hospital stay was? Some ranking of severity of condition? It sounds like an interesting marketing perception problem!)


    -------------------------------------------
    G. Michael Phillips
    Statistician
    Phillips Fractor Gorman
    -------------------------------------------








  • 22.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 15:23
    If I understand the problem correctly, the data are ordinal categorical in nature.
    Talking about correlation coefficient and regular regression may not be the best approach.
    Subjective rating or scoring could be analyzed with discrete data analysis.
    With only 20 nurses, there may not be many distinct frequency counts in your categorical data model.
    Collapsing the score categories may be needed.

    Good luck.

    -------------------------------------------
    Winson Taam
    The Boeing Company
    -------------------------------------------








  • 23.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 15:20
    Beverly,

    I found the answer to your question in "Linear Models", by S. R. Searle, Wiley 1971, on page 103.

    If you let the X's be the nurse satisfaction and the Y's be the patients' evaluations, do a linear regression of Y on X, where the X's are replicated 10 times to match the Y's.  Then, to test if X predicts Y, let SSE = the regression sums of squares of errors, let SSPE = the sum over the 20 nurses of (the sum over the patients of the nurse of (the evaluation for the patient - the mean of the evaluations for the nurse) squared ), then F = ((SSE - SSPE) /( 20-2 )) / (SSPE/(200 - 20)) on 18 and 180 degrees of freedom can be used to test the model.  (SSPE stand for sums of squares of pure error.)

    Hope this is helpful. 

    Margot
    -------------------------------------------
    Margot Tollefson
    Owner
    Vanward Statistical Consulting
    -------------------------------------------








  • 24.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-21-2011 17:36
    Beverly,

    I made a mistake in my last post.  On a more careful reading of Searle, the test I described is a lack of fit test for the model.  If the test is not significant, then it is okay to use the regular regression statistics (the F or t tests) to test if the regression coeficients are signficantly different from zero.  For a linear regression with just one independent variable, regression is equivalent to correlation.

    Sorry.

    Margot

    -------------------------------------------
    Margot Tollefson
    Owner
    Vanward Statistical Consulting
    -------------------------------------------








  • 25.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 08:40
    Bootstrap?  Randomly select one patient observation from each of the 10 associated with a nurse's score - compute the correlation coefficient across the 20 nurses, store the value and repeat the resampling (with replacement) until a stable density plot is obtained - if skewed use the median, if normal use the mean to calculate the overall correlation between nurse satisfaction and patient satisfaction, you also get standard errors and percentiles as well this way.  This can be done in R or a spreadsheet program that has a RAND function.

    -------------------------------------------
    Scott Holcomb
    -------------------------------------------








  • 26.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 11:06
    I've found this discussion thread very stimulating and close to worthy of the topic's being a doctoral orals question in a research methodology program!  However, it also reminds me of the gag, if you get 12 statisticians in a room to discuss something, you'll get 13 opinions! 

    -------------------------------------------
    Milton Goldsamt
    Survey Statistician
    -------------------------------------------








  • 27.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 11:12


    Or:

    Variance (noun):  What any two statisticians are at



    -------------------------------------------
    David Lyon
    Aurora Market Modeling, LLC
    -------------------------------------------








  • 28.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 11:16
    If I were the OP, I'd be interested in more than the effect of the central tendency.  What if it is the lowest rating a given patient gives that affects a nurse's rating (i.e. you get one complaining patient and it ruins your whole day).  Or maybe it's the highest one (at least you got something right that day) or maybe it's the range (consistency).  The problem seems to me to beg for solution that allows for testing the different possible models.

    -------------------------------------------
    Bridget Bly
    -------------------------------------------








  • 29.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 11:41

    You are being modest, Milton.  More likely, out of 2 statisticians you will get 3 opinions.


    -------------------------------------------
    Mansour Fahimi
    VP, Statistical Research Services
    Marketing Systems Group
    -------------------------------------------








  • 30.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 12:29

    As this is the statistical consulting section I think we should concentrate on what constitutes good consulting when a problem is poorly posed. This is one of the most common experiences we have in consulting.  First Beverly seems to have a real problem which is something to do with whether the nurses are satisfied with their performance compared to the judges (ten patients that they treated).  But Beverly posed this in purely statistical terms as a problem of correlating the nurses satisfaction score with ten other scores.  Should we summarize the ten response by an average and then correlate that with the nurses response.  So some of us took it purely as a mathematical/statistical question and proposed answers.  Some worried about ordinal vs continuous measurement.  Others thought about changing ther analysis to principal components or multiple regression.  One person even suggested bootstrap.  Well we all have our favorite ways of analyzing data and that seems to be entering into the discussion because the question is so open-ended to allow it.

    A few of us saw this as a consulting exercise and did the proper thing.  First let's not commit the type III error the cardinal sin of consulting (i.e. apply a perfectly brilliant solution to the wrong problem).  I don't think that Beverly's initial message gave a us a well posed problem.  Whether it did or not it did not describe the application.  Why are the nurses and patients being surveyed?  Are we trying to see if there is agrrement on individual nurses by the 10 raters.  That would involve just a measure of level of agreement between the raters.  That is a problem we know how to solve.  It might be one of many problems that Beverly is trying to address.  We know measure of interrater agreement and how to estimate them.  But the nurses are also providing their own rating and it appears more likely that Beverly is more interested as to whether the nurses agree with the patients related to their performance.  In that case maybe we want to make a pairwise comparison between the nurse and each of her ten patients.  Some of us have asked some of these questions but we have not gotten all the answers.  So we are not in the position to give good consulting advice yet.  We need to hear the answers to all our questions and these answers will likely raise additional questions.  My advice is to be a good consultant and make sure you know the "real" problem before you jump at the solution.

    There is some joking going around now as we read the myriad of solutions often to different problems than what Beverly may have intended.  In consulting there is no one "right" solution but there are good ones and there are "bad" ones.  I am sure that everyone responding with solutions is a well trained statistician and all the solutions presented would be good solutions if they address the right question.  But I don't think we have enough background as to what the real problems is.  My guess is that most of us a repeatedly committing the type III error!

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------








  • 31.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 12:54

    Wow!!!

    Thanks for writing this email.
    -------------------------------------------
    Kenita Hall
    -------------------------------------------








  • 32.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 14:30
    I particularly enjoyed encountering the Type III error definition. That is one I will remember.

    Best wishes,

    Nayak



    -------------------------------------------
    Nayak Polissar
    Consultant
    The Mountain Whisper Light
    -------------------------------------------








  • 33.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 17:39
    I agree with Michael's insightful comments.  Here are some additional reflections in this domain:  I think that there is very often a question behind the problem that is more general and, in many cases, more interesting.  An important job of the consulting statistician is to help translate from the more general to something that can be addressed with out methods.  This challenge can be viewed as a poorly posed problem by an inadequately trained client or as an opportunity for an interesting collaboration.  Many of us have seen examples of each.  In the present setting, the more general question might be something like: do workers who have greater job satisfaction perform better?  It gets refined to nurses in a particular setting, self-reported job satisfaction, self-reported patient perception of care provided by the nurse.  The discussion has revealed that we, as statisticians, have something to add to process of formulating the specific question to be addressed by statistical methods.  Most people see mu while we see sigma; it has been noted that the variability of the patient responses might be interesting and important.  Given this idea, one go one step further to look at the proportion of failures, scores in the unacceptable range which might require some retraining or corrective action. 

    -------------------------------------------
    George McCabe
    Purdue University
    -------------------------------------------








  • 34.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 17:56
    When I started consulting in 72, I framed the process as moving from the presenting question to the underlying question. I my whole career I have had 3 times the presenting question was the underlying question. One time was a researcher who had come before for several projects.  One was from a staffer for Senate Foreign Relations. The last was from a staffer from House Armed Services. I had been consulting on stat and methods for 18 years before my first encounter with the presenting question being the same as the underlying question.

    This was less than 1 percent of the time.


    -------------------------------------------
    Arthur Kendall
    Social Research Consultants
    -------------------------------------------








  • 35.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-25-2011 11:14

    First I want to say I think this is about the most hobbled, inept email system I have ever seen.  I am not even sure this will reach anyone but will try.  A lot of good thoughts from a lot of smart people.

    My only comment:  A problem well-stated is a problem half solved ( maybe more than half like 2/e or whatever).  I forgot how this is supposed to go but I think it is what a lot of these emails are saying.

    -------------------------------------------
    J. Dobbins
    Delmarva Foudation
    -------------------------------------------








  • 36.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 18:38
    I think I'm going to disagree with Michael.  I think that Beverly's question was fairly well-posed and rather clearly stated.  Michael's concern seems to be about whether a correlation coefficient is the best thing to be chasing after for Beverly's practical problem...which, I'll grant, is a legitimate concern.

    -------------------------------------------
    Eric Siegel
    Boistatistician
    Univ of Arkansas for Medical Sciences
    -------------------------------------------








  • 37.  RE:Correlation Between X and Y, where each Y has multiple X values

    Posted 04-22-2011 19:04
    I think you miss the point Eric.  It is not whether or not Beverly has a well posed statistical problem.  I may think there is a little ambiguity but I won't quibble on that.  Suppose she just wants to know that there is a statistically valid way of computing an estimate of bivariate correlation when one nurse's score is to be compared to her ten patients.   Well there may not be a standard answer in the literature.  So some of us have been creative and come up with a variety of different potentially reasonable approaches.

    The problem is that I think this is a real problem searching for a solution.  I can accept that to average the ten responses to create a bivariate pair might be a legitimate solution to the problem that was posed.  But an experienced consulting statistician would not believe that giving that answer is solving the underlying problem! 

    That is why so many excellent consultants came out of the woodwork when they saw my response.  To them what we were doing for Beverly was a bunch of pointless exercises.  We just heard one consultant say in his experience only 1% of the problems posed to him initially were the underlying question.  So what has changed the direction of the discussion is that first and foremost the consulting statistician must sit down and talk with the client to discover the underlying question and then to see if there is a way to use his statistical expertise to solve that problem.

    Before that the discussion was mostly silly and uninteresting.  It may be interesting mathematically or theoretically but I would have no confidence that any of the suggestions posed no matter how well thought out really provides a good service to the client because I get the strong feeling that I haven't heard the whole story.  Remember Beverly invited questions but only a few of us took her up on it.  Maybe I haven't tracked the discourse well enough but I didn't hear answers that would lead me to say aha I ubderstand your problem.

    -------------------------------------------
    Michael Chernick
    Director of Biostatistical Services
    Lankenau Institute for Medical Research
    -------------------------------------------