Closing the loop on your basic question. The systematic sample will produce unbiased point estimates. The standard variance formulas from a systematic sample will produce an overestimate of the true variance, which is generally considered conservative (and acceptable for most purposes). As mentioned by William, it is possible to improve on those variance estimates, but it may not be worth the effort in your situation.
Original Message:
Sent: 09-05-2024 14:26
From: Andrew Tew
Subject: Does sort order imply that a sample is not representative?
Thank you William for the book suggestion and other comments. As much as I would like to do a simulation effort, I don't think it is all that feasible with the full scope of the project itself. The Systematic PPS sampling that I described previously is the first stage of a three-stage process that simply selects a particular subset of financial disbursement records that will undergo a more-detailed review by means of a data call and the last two stages of the sampling process. So the simulation effort would learn more information about the potential first stage properties, but it unfortunately will not have any of the more-detailed information that is used in the end.
And as I mentioned above in a response to @David Marker, the same sort order / Systematic PPS sampling scheme I described is applied to other stratums that are not necessarily state-level type entities. With it being financial records, these other stratums could contain items that are simply individuals/businesses, are a county/tribal region, or some other entity type. In total it is about a five-tier hierarchy where the ultimate goal is a confidence interval about the highest tier's financial data using knowledge gained from every subsequent tier.
But my original question was about any bias being introduced by how the data is sorted for the Systematic PPS sampling. I don't even has access to the more-detailed data but am tasked with merely reviewing the sampling and estimation plan itself.
------------------------------
Andrew Tew
Data Scientist
ASRC Federal
Original Message:
Sent: 09-05-2024 09:21
From: William Huber
Subject: Does sort order imply that a sample is not representative?
It's worth noting that even when the start point and sort order are randomized, the standard variance calculations are likely to be poor estimates and can lead to incorrect standard errors and incorrect p-values. The variance calculations need to account for the joint probabilities of being in the sample. (Steven Thompson explains this very well in his book, Sampling, now in its second edition, and derives relevant formulas.) You have noted that some of these joint probabilities are zero--but that's only a part of the problem.
I have frequently encountered this issue in systematic spatial samples of irregularly-shaped regions: this sounds like it's exactly the same problem (albeit in just one dimension). The calculation of the joint probabilities can be complicated. A quick simulation (iterate the sampling plan 50+ times and track all pairs of states) will give useful estimates. In this one-dimensional situation it is straightforward to obtain closed-form results but they would be conditional on the sort order--over which you would have to average if you are randomizing it. I am not aware of any source of those results; you might find the simulation approach to be quicker and easier to carry out.
------------------------------
William Huber
Quantitative Decisions
Original Message:
Sent: 09-03-2024 20:43
From: Andrew Tew
Subject: Does sort order imply that a sample is not representative?
I have a work project that is reviewing the work done by another team and this question is about the sampling plan that was implemented. For some background, the project has several population types, but the easiest to explain is for the population that state-level entities. Using the state-level entity, there is a preference for conducting the sampling plan using the probability proportional sampling (PPS) using an ancillary variable such as a state's land area or population to get a better sense of a state's contribution to the overall population estimate. In addition to using PPS, it has been implemented with a systematic sampling plan but the data is sorted in terms of the ancillary variable.
However, the sampling interval (i.e., skip every Kth data point) is large enough that it exceeds the combined sum of the ancillary variable for the first 6, 7, or even 8 states and that some of the states will be selected several times because the sampling interval fits their ancillary variable with at least 2, 3, or even 4 multiples alone. As a result, the sampling plan is guaranteed to never select any combination of states from the first 6, 7, or 8 values whereas a simple random sampling plan would at least allow for this possibility. So to tie in the question referenced in the discussion subject, is it possible that the resulting sample of the Systematic PPS Sampling plan is not considered to be representative based on the sort order used?
I would think that the Systematic PPS Sampling plan can still be used, but that the data should be sorted by other means such as in alphabetical order, ZIP code order, or even a different ancillary variable. At least these orderings would potentially allow the states with the lower values of the primary ancillary variable to be selected at the same time.
Any comments or other ideas would be greatly appreciated.
------------------------------
Andrew Tew
Data Scientist
ASRC Federal
------------------------------