I have a work project that is reviewing the work done by another team and this question is about the sampling plan that was implemented. For some background, the project has several population types, but the easiest to explain is for the population that state-level entities. Using the state-level entity, there is a preference for conducting the sampling plan using the probability proportional sampling (PPS) using an ancillary variable such as a state's land area or population to get a better sense of a state's contribution to the overall population estimate. In addition to using PPS, it has been implemented with a systematic sampling plan but the data is sorted in terms of the ancillary variable.
However, the sampling interval (i.e., skip every Kth data point) is large enough that it exceeds the combined sum of the ancillary variable for the first 6, 7, or even 8 states and that some of the states will be selected several times because the sampling interval fits their ancillary variable with at least 2, 3, or even 4 multiples alone. As a result, the sampling plan is guaranteed to never select any combination of states from the first 6, 7, or 8 values whereas a simple random sampling plan would at least allow for this possibility. So to tie in the question referenced in the discussion subject, is it possible that the resulting sample of the Systematic PPS Sampling plan is not considered to be representative based on the sort order used?
I would think that the Systematic PPS Sampling plan can still be used, but that the data should be sorted by other means such as in alphabetical order, ZIP code order, or even a different ancillary variable. At least these orderings would potentially allow the states with the lower values of the primary ancillary variable to be selected at the same time.
Any comments or other ideas would be greatly appreciated.
------------------------------
Andrew Tew
Data Scientist
ASRC Federal
------------------------------