I also would like to utilize all of the data. The reason to do sampling because it takes a lot of labor work to check too many samples with another dataset which does not have a common id. It is the goal to find a matching record and report it as a finding to evaluate the existing matching algorithm, and therefore, it is a rare event.
Original Message:
Sent: 09-10-2015 06:47
From: Joseph Nolan
Subject: Advice on proportion confidence interval for a statified random sample
Hello,
My main suggestion would be to utilize all of your data. In the age of computers, why is it necessary to reduce 18000 observations to 360? When you've then stratified beyond that, it leaves you with samples of size roughly 72 (assuming equal sample sizes which they may or may not be). In particular since in appears your event is rare, that sample size would not lead to confidence intervals that do a good job (i.e. have reasonable margins of error) of estimating the probability of your event. There are of course some methods out there that would adjust for the rare event issue (sorry, I don't know the references off the top of my head). But I'd start with using everything I had rather than just 2% of it.
Cheers,
Joe
------------------------------
Joseph Nolan
Associate Professor of Statistics
Director, Burkardt Consulting Center
NKU Department of Mathematics & Statistics
------------------------------
Original Message:
Sent: 09-09-2015 09:28
From: Iris Cheng
Subject: Advice on proportion confidence interval for a statified random sample
Hi, I have a big dataset (18000 records) that I would like to look for the proportion probability of found or not found of a certain condition. I stratified the population into 5 groups based on the hospital coverage performance since the numbers of records in each hospitals are vary (some have thousands, some are on the 10s). I performed 2% SRS on each strata for the amount of labor work I can expense. However, after I completed the checking of each sample, on one strata I did not find any records match the condition which give me p_i = 0. Also, other stratum gave low proportion (I think maybe the condition is rare).
My questions are
1) Can I still calculate the confidence intervals when some stratum gave me proportion = 0;
2) What are the limitation of using stratified sampling? I recalled from somewhere tells me that if the proportion of the strata is too low (less than 0.1) then it may not have a good estimation.
I appreciate all the explanation and reference.
Thanks!
------------------------------
Iris Cheng
Graduate Student at Baruch College
------------------------------