ASA Connect

 View Only
Expand all | Collapse all

Does sort order imply that a sample is not representative?

  • 1.  Does sort order imply that a sample is not representative?

    Posted 30 days ago

    I have a work project that is reviewing the work done by another team and this question is about the sampling plan that was implemented.  For some background, the project has several population types, but the easiest to explain is for the population that state-level entities.  Using the state-level entity, there is a preference for conducting the sampling plan using the probability proportional sampling (PPS) using an ancillary variable such as a state's land area or population to get a better sense of a state's contribution to the overall population estimate.  In addition to using PPS, it has been implemented with a systematic sampling plan but the data is sorted in terms of the ancillary variable.

    However, the sampling interval (i.e., skip every Kth data point) is large enough that it exceeds the combined sum of the ancillary variable for the first 6, 7, or even 8 states and that some of the states will be selected several times because the sampling interval fits their ancillary variable with at least 2, 3, or even 4 multiples alone.  As a result, the sampling plan is guaranteed to never select any combination of states from the first 6, 7, or 8 values whereas a simple random sampling plan would at least allow for this possibility.  So to tie in the question referenced in the discussion subject, is it possible that the resulting sample of the Systematic PPS Sampling plan is not considered to be representative based on the sort order used?

    I would think that the Systematic PPS Sampling plan can still be used, but that the data should be sorted by other means such as in alphabetical order, ZIP code order, or even a different ancillary variable.  At least these orderings would potentially allow the states with the lower values of the primary ancillary variable to be selected at the same time.

    Any comments or other ideas would be greatly appreciated.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------


  • 2.  RE: Does sort order imply that a sample is not representative?

    Posted 30 days ago

    I'm not a sampling expert, but if I did a sample like this I would choose the starting point with a random number, which would give the smallest categories a possibility of being selected proportional to the number in them.

    Say the first 6 states had 500 relevant subjects and the sampling plan was to skip 1,000 between selections.  If you started with a uniform random number, the 6 would collectively have a 50% chance of being included.

    Starting in any way but with a random number would invalidate the sampling, IMHO.

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 3.  RE: Does sort order imply that a sample is not representative?

    Posted 24 days ago

    Sort order in sampling implies the usage of systematic random sampling technique which requires a random start that would occasion the sorting. 

    Though the technique is probabilistic and create randomness in sampling, but may lead to biased sample as a result of the sorting which has systematically edged out some members of the population due to the choice of a random start.



    ------------------------------
    Nureni Adeboye
    Lecturer
    Osun State University, Osogbo, Nigeria
    ------------------------------



  • 4.  RE: Does sort order imply that a sample is not representative?

    Posted 30 days ago

    Every unit needs to have a non-zero probability of being sampled. For systematic sampling, this means incorporating a random start. 

    It should be fine to sort by whatever information you'd like and then use systematic PSS as long as the first unit selected is randomly selected such that each unit has a non-zero chance of being selected.



    ------------------------------
    David Wilson
    Director, Statistics
    RTI International
    ------------------------------



  • 5.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    Thank you for the comments David and @Edward Gracely.  I can say that the starting point is random, so every values does have a non-zero probability of being selected.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------



  • 6.  RE: Does sort order imply that a sample is not representative?

    Posted 30 days ago

    What you describe is quite common, and appropriate for producing the most accurate national estimates. Whatever you use for your sort order reduces variation in the possible samples, and thus produces a more accurate estimate (if the sort variable is correlated with the outcome measure). While you are correct that you can't get a combination of some small states both included, you also have eliminated (or at least reduced) the possibility that none of the small states is included in the sample. If in addition to national estimates, you want, say, to produce regional ones, then the sort should first group all states in each region, then within that sort you can further sort by state. This will make the number of sampled cases from a region consistent across possible samples, improving the accuracy of regional estimates.  If you want accurate state estimates you will need to increase the number of sampled units.



    ------------------------------
    David Marker
    Senior Statistician
    Marker Consulting
    ------------------------------



  • 7.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    Thank you David for the comments.  One part that I didn't mention earlier it is a blended population (deals with financial records) but multiple stratums are created that the Systematic PPS sampling described above is applied to.  Some of the strata are state-level entities whereas others could be an individual or even on par of a county or congressional district level.

    As the sort order is the financial-based entity, I guess it makes more sense that the estimate itself would be more accurate within the context of the sampling plan per strata that is being used.  Whether or not that accuracy holds as the different stratums are then consolidated into a single point estimate (and interval) is probably a different question entirely.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------



  • 8.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    As I re-read your post more carefully I realize that I misread it. You were concerned about the fact that this method, even with a random start, would never select 2 (or 3) from the first small states. A purely random selection could do that. You knew that there was a chance that at least one of the first few would be selected.

    I don't think that invalidates the method. In a typical systematic sample, every point has to have the same probability of being selected. Not combinations. 

    Say I sorted subjects for a weight reduction study by weight and picked every 10th one from a long list. Many people close to each other could not BOTH be selected, but the method is still perfectly fine.

    Ed



    ------------------------------
    Edward Gracely
    Associate Professor
    Drexel University
    ------------------------------



  • 9.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    It's worth noting that even when the start point and sort order are randomized, the standard variance calculations are likely to be poor estimates and can lead to incorrect standard errors and incorrect p-values.  The variance calculations need to account for the joint probabilities of being in the sample.  (Steven Thompson explains this very well in his book, Sampling, now in its second edition, and derives relevant formulas.)  You have noted that some of these joint probabilities are zero--but that's only a part of the problem.

    I have frequently encountered this issue in systematic spatial samples of irregularly-shaped regions: this sounds like it's exactly the same problem (albeit in just one dimension).  The calculation of the joint probabilities can be complicated.  A quick simulation (iterate the sampling plan 50+ times and track all pairs of states) will give useful estimates.  In this one-dimensional situation it is straightforward to obtain closed-form results but they would be conditional on the sort order--over which you would have to average if you are randomizing it.  I am not aware of any source of those results; you might find the simulation approach to be quicker and easier to carry out.



    ------------------------------
    William Huber
    Quantitative Decisions
    ------------------------------



  • 10.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    Thank you William for the book suggestion and other comments.  As much as I would like to do a simulation effort, I don't think it is all that feasible with the full scope of the project itself.  The Systematic PPS sampling that I described previously is the first stage of a three-stage process that simply selects a particular subset of financial disbursement records that will undergo a more-detailed review by means of a data call and the last two stages of the sampling process.  So the simulation effort would learn more information about the potential first stage properties, but it unfortunately will not have any of the more-detailed information that is used in the end.

    And as I mentioned above in a response to @David Marker, the same sort order / Systematic PPS sampling scheme I described is applied to other stratums that are not necessarily state-level type entities.  With it being financial records, these other stratums could contain items that are simply individuals/businesses, are a county/tribal region, or some other entity type.  In total it is about a five-tier hierarchy where the ultimate goal is a confidence interval about the highest tier's financial data using knowledge gained from every subsequent tier.

    But my original question was about any bias being introduced by how the data is sorted for the Systematic PPS sampling.  I don't even has access to the more-detailed data but am tasked with merely reviewing the sampling and estimation plan itself.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------



  • 11.  RE: Does sort order imply that a sample is not representative?

    Posted 29 days ago

    Closing the loop on your basic question.  The systematic sample will produce unbiased point estimates. The standard variance formulas from a systematic sample will produce an overestimate of the true variance, which is generally considered conservative (and acceptable for most purposes). As mentioned by William, it is possible to improve on those variance estimates, but it may not be worth the effort in your situation.



    ------------------------------
    David Marker
    Senior Statistician
    Marker Consulting
    ------------------------------



  • 12.  RE: Does sort order imply that a sample is not representative?

    Posted 28 days ago

    The originally described plan seemed to cluster the sample units into k-number of clusters and then select a sample of size 1 from the k clusters, using a random starting point within the first cluster. An approach I would recommend instead is to use multiple random starts independently selected, say r-number of them with the systematic spacing being every (r*k)'th unit is selected. (These are interpenetrating subsamples. Setting r=5 could be a good, manageable choice.) Calculate your population-level estimates independently for each independent subsample, and then combine these r-number of population-level estimates by computing their arithmetic means and std. errors of the means. The within-cluster variability is extrinsic to this approach. 

    (Early in my career I worked with data like this for state fish & wildlife agencies that wanted to analyze the experiences of their licensed game hunters periodically with mail surveys, using as a data frame the agency's central office collection of hunting license applications that were stored on paper pages in file cabinets. [Affordable microcomputers didn't exist yet, and some client agencies had limited or no access to mainframe computers other than what we consultants provided.] Systematic sampling could be carried out by measuring the linear length of the entire license file and dividing that length by the sample size desired, then use a ruler or caliper set to that distance to choose the members of each subsample.)



    ------------------------------
    Wayne Cornelius
    ------------------------------



  • 13.  RE: Does sort order imply that a sample is not representative?

    Posted 27 days ago

    Thank you Wayne for the response.  I didn't want to go into super fine details initially, but within a given strata (of which the Systematic PPS sampling and other stage designs is done independently),

    • the overall population contains a list of disbursements (either issued by the main program office as a debit payment or received by the main office as a credit)

    • the population is sorted by the individual recipients based on total disbursement amounts (i.e., absolute value of money exchanged)

    • the Systematic PPS sampling then selects a set of disbursements from the sorted list as the primary objective but is also selecting as a set of recipients as secondary objective (with some recipients being selected multiple times depending on their total disbursement size or the randomly selected starting point of the Systematic PPS sampling)

    • the selected recipients are then asked in a data call to provide some additional transaction detail (such as invoices and line items) for each selected disbursement.  This returned "more detailed" data is then fed into many subsequent stages with summaries that eventually trigger back up to calculate a an estimated strata mean and variance.

    • Finally, the individual strata estimates are then combined to create a "whole program" estimate that is a confidence interval (well upper confidence bound to be exact).

    The issue with the potential multiple random starts is that the data call for the selected disbursements (and subsequent stages) is a fairly taxing exercise from what I understand and as a result the program office doing the review is trying to limit the number they do to as few as "statistically required" (which could be a whole different discussion in its own right).



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------