ASA Connect

 View Only
  • 1.  Advice on proportion confidence interval for a statified random sample

    Posted 09-09-2015 09:28

    Hi, I have a big dataset (18000 records) that I would like to look for the proportion probability of found or not found of a certain condition. I stratified the population into 5 groups based on the hospital coverage performance since the numbers of records in each hospitals are vary (some have thousands, some are on the 10s). I performed 2% SRS on each strata for the amount of labor work I can expense. However, after I completed the checking of each sample, on one strata I did not find any records match the condition which give me p_i = 0. Also, other stratum gave low proportion (I think maybe the condition is rare).

     

    My questions are

    1) Can I still calculate the confidence intervals when some stratum gave me proportion = 0;

    2) What are the limitation of using stratified sampling? I recalled from somewhere tells me that if the proportion of the strata is too low (less than 0.1) then it may not have a good estimation.

    I appreciate all the explanation and reference.

    Thanks!



    ------------------------------
    Iris Cheng

    Graduate Student at Baruch College
    ------------------------------



  • 2.  RE: Advice on proportion confidence interval for a statified random sample

    Posted 09-10-2015 06:47

    Hello,

    My main suggestion would be to utilize all of your data.  In the age of computers, why is it necessary to reduce 18000 observations to 360?  When you've then stratified beyond that, it leaves you with samples of size roughly 72 (assuming equal sample sizes which they may or may not be).  In particular since in appears your event is rare, that sample size would not lead to confidence intervals that do a good job (i.e. have reasonable margins of error) of estimating the probability of your event.  There are of course some methods out there that would adjust for the rare event issue (sorry, I don't know the references off the top of my head).  But I'd start with using everything I had rather than just 2% of it.

    Cheers,
    Joe


    ------------------------------
    Joseph Nolan
    Associate Professor of Statistics
    Director, Burkardt Consulting Center
    NKU Department of Mathematics & Statistics
    ------------------------------





  • 3.  RE: Advice on proportion confidence interval for a statified random sample

    Posted 09-11-2015 12:48

    Thanks Joe,

    I also would like to utilize all of the data. The reason to do sampling because it takes a lot of labor work to check too many samples with another dataset which does not have a common id. It is the goal to find a matching record and report it as a finding to evaluate the existing matching algorithm, and therefore, it is a rare event.

    Do you have other thoughts?

    Thanks,



    ------------------------------
    Iris Cheng

    Graduate Student at Baruch College
    ------------------------------




  • 4.  RE: Advice on proportion confidence interval for a statified random sample

    Posted 10-27-2015 09:27
      |   view attached


    I just had a paper published by International Researchers on CI's for the hypergeometric distribution.  Finding the CI for the number with the condition in each strata and adding the upper limits and the lower limits should give a CI for the population, then dividing by the size of the population should give the CI for the proportion.  The International Researchers website has been down for the last few days.  I have attached the R function that calculates the CI's.

    Best,

    Margot Tollefson

    ------------------------------
    Margot Tollefson
    Consultant
    Vanward Statistics

    Attachment(s)

    txt
    hyper.CI.R.txt   6 KB 1 version


  • 5.  RE: Advice on proportion confidence interval for a statified random sample

    Posted 10-28-2015 09:48


    You have two problems.  One is that you have a clustered sample, which you need to take into account in  estimating both the population proportion and the standard error of that estimate.  Two is that the standard Wald coverage interval will likely not be very good. For two sided coverages, it can be replaced by a modified Wilson interval.  See, for, example,  https://fcsm.sites.usa.gov/files/2014/05/2001FCSM_Kott.pdf.  Good one-sided intervals are a more difficult to construct

    .

    ------------------------------
    Phillip Kott
    RTI International