Thanks David for the comments and I am definitely want to go read up on the two resources you mentioned.
For starters, I am not the one actually conducting the sampling and/or estimating. It is my task to support another office in my organization that leads this effort with an outsider's perspective by conducting a Quality Assurance review of their whole process. It is upon reviewing the sampling and estimation plan (and also looking at a Penn State online course [STAT 506] that uses the Sampling textbook by Steven Thompson) that I had the concerns listed in my initial post.
As for the software, I know that this has been done for many years (and probably no real change over time) from the effort done in a previous fiscal year (which I was provided as part of my review) is that everything seemed to be done with SAS. For the current fiscal year though, I think some of the process is being done with R (but I can only see up to the first stage process). I can say that I am a little light on any SAS specific items (i.e., a lot of Google searching), but I was at least able to get it to the point where I now understand what was being done previously.
As for the rationale behind the pooling of tertiary units, I honestly don't know why it is there but I can say that the pooling of tertiary units takes place AFTER the secondary units have been sampled.
As for the variance calculation, the current plan that I am reviewing has a single summation that is based on terms being a third stage ratio calculation which is then scaled up by a set of three multiplicative weights (one based on the first stage PPS but the other two are connected to the second stage).
- Connected with the tertiary pooling observation above, the two weights relative to the second stage seem to only be based on the secondary units selected and not the secondary units that remain with the selected tertiary units. For example, suppose a set of three secondary units were selected that create a pool of four tertiary units (i.e., two of the secondary units have one tertiary unit and the other the other secondary unit has two tertiary units). If only two tertiary units are to be selected in Stage 3, there are situations where at least one of the secondary units leaves the picture (i.e., the two singleton tertiary units are selected) or at least two of the secondary units leave the picture (i.e., the tertiary units that paired). However, the weighting process for the second stage is still based on all of the selected secondary units.
- To make matters worse, there are some of the selected primary (and possibly secondary) sampling units that have all of their secondary (and possibly tertiary) sampling units included whereas other primary (secondary) units will only have a subset selected because they have more than others. An extreme example of this is when a primary sampling unit only has one connected secondary unit that also only has one connected tertiary unit. This means that the "variable of interest" for some of the primary sampling units is completely known (like what would happen in a clustered sampling plan) where as others have to be estimated because not all secondary units are used.
As for the first stage PPS, I know that they are sorting the transactions by increasing order, but it is also my belief that a single transaction could be a part of multiple primary sampling units. For example, suppose there were a set of 9 transactions that had dollar amounts of 1, 2, 3, 4, 5, 6, 7, 8, and 9. If the interval used is 9, then I think the primary sampling units being considered (where a transaction is included if its dollar amount is shown) are the following:
- PSU #1 = 1, 4, 6, 7, 9
- PSU #2 = 2, 5, 6, 8, 9
- PSU #3 = 2, 5, 6, 8, 9
- PSU #4 = 3, 5, 7, 8, 9
- PSU #5 = 3, 5, 7, 8, 9
- PSU #6 = 3, 5, 7, 8, 9
- PSU #7 = 4, 6, 7, 8, 9
- PSU #8 = 4, 6, 7, 8, 9
- PSU #9 = 4, 6, 7, 8, 9
And that of this set of 9 primary sampling units, only 1 of them gets chosen based on the random starting value from 1 to 9.
Thanks again for the comments and I hope to learn more by reading the suggested resources.
------------------------------
Andrew Tew
Data Scientist
ASRC Federal
------------------------------
Original Message:
Sent: 09-16-2024 09:57
From: David Wilson
Subject: Multistage Variability Concerns
Andrew,
1) I find the pooling of tertiary units odd. If you do that, it's possible you sample no units from a secondary sampled unit. What's the rationale for pooling tertiary units prior to sampling?
2) Given the first stage is systematic PPS, there is no unbiased estimator of design variance. Under this design, I would use a variance estimator that only considered the first stage of sampling but there's no singular estimator to use. You could consider using the variance estimator that assumes the first stage was actually PPS WR. For other estimators, see Wolter (2007) Introduction to Variance Estimation, Section 8.7.
You should check Wolter to see what assumptions and implications lay behind assuming a PPS WR design when you actually used a systematic PPS.
In practice, I've seen the PPS WR variance estimator used and I've seen a Jackknife replicate variance estimator used where pseudo-strata are formed by pairing first stage PSUs and replicate weights produced from that.
As a general comment, when you have more than two stages of sampling or more than one stage of non-simple random sampling, the variance estimation process becomes very complicated (see Sarndal, Swensson, and Wretman (1992.) Model Assisted Survey Sampling, section 4.4 Multistage sampling.) A practical way to avoid this complication is to develop a sampling process that permits you to only focus on the variance contribution from the first stage. One way to do this is to use WR sampling at the first stage and then ignore the subsequent sampling stages (for variance estimation purposes.)
I'd also like you to consider how you plan to come up with variance estimates. If you plan to use existing software such as SAS and Stata and R, I do not believe that there is support for variance estimation for all possible multistage sample designs. (I'm less certain about R because someone could have created a package for some unique design.) Anyway, if you develop a sample design that requires you to write your own code to generate variance estimates then you should take that into consideration to see if your budget and/or schedule and/or capabilities permit that.
------------------------------
David Wilson
Director, Statistics
RTI International
------------------------------