ASA Connect

 View Only
  • 1.  Multistage Variability Concerns

    Posted 09-13-2024 13:51

    I have a sampling and estimation plan that I need to review (of which asking this question is connected), but I have a feeling that the variance being described is not accurate.  For a brief background, the grand total is a wrapped up summary from a five-layer entity with the first-to-second layer being a stratification process that applies the process given below for each identified strata.

    Once the focus shifts to being within a strata, a three-stage sampling plan is conducted where

    • the first stage is based on a recorded set of transactions, where the dollar amount is recorded in the absolute value of money exchaged.

    • the second stage is based on a recorded set of invoices that when added up have a total dollar amount of a first stage unit.

    • the third stage is based on a recorded set of line items that when added up have a total dollar amount of a second stage unit.

    However, when the sampling and estimation process begins, the variable of interest is tied to the third stage items and requires a separate data call to record its actual value.  This data call isn't performed until after the first stage units are identified.

    In terms of the random sampling processes being used,

    • The primary sampling units are selected by means of a Systematic Probability Proportional to Size sampling scheme (which is where the connection to the previous question comes in).  It should be pointed out though that although multiple disbursements are selected in stage 1, they are all connected to a single primary sampling unit used within the Systematic PPS sampling scheme.

    • The secondary sampling units are selected by means of a Simple Random Sample sampling scheme based on the number of invoices a given primary sampling unit has

    • The tertiary sampling units are also selected by means of a Simple Random Sample sampling scheme but are not based directly off their corresponding secondary sampling units.  Instead, all line items included in any of the selected secondary sampling units are pooled together and the tertiary sampling units are the combined set of line items.

    • It is also possible that for a particular transaction included with the primary sampling unit that the entirety of its secondary sampling units or tertiary sampling units are selected if the count for that particular stage is under a pre-established threshold.

    Now after the tertiary sampling units have been selected, this is where the actual variable of interest is recorded with some sort of extrapolation process to summarize back to the primary sampling unit.  In terms of the estimator (and its variance), there is only a single summation based on the third stage variable of interest and a series of weighting factors.  In terms of the variability estimate being made, I have some concerns that

    • The variance components of the second and third stages of sampling are not being taken into consideration.

    • The fact that a single primary sampling unit is used makes it hard to incorporate the standard sample variance formula (because n=1).

    • The pooling behavior that happens between Stage 2 and Stage 3 that creates the Stage 3 sampling frame might be breaking up the pipeline.  I think this also makes the secondary sampling units (i.e., the invoices) as being highly unnecessary to the overall process.

    Are these concerns justified?  Are there concerns that I might be missing?  And given that the three stages are all connected to dollar amounts (with Transaction >= Invoice >= Line Item), I think an improvement can be made that makes the problem just a two-stage process that enables the transactions be sampled using a PPS w/ replacement and then the line items be sampled using a PPS w/ replacement.  If this is used, I'm still unsure what the estimator variance equation would be, so any comments would be greatly appreciated.

    I look forward to any feedback that anyone is willing to share.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------


  • 2.  RE: Multistage Variability Concerns

    Posted 09-16-2024 09:58

    Andrew,

    1) I find the pooling of tertiary units odd. If you do that, it's possible you sample no units from a secondary sampled unit. What's the rationale for pooling tertiary units prior to sampling?

    2) Given the first stage is systematic PPS, there is no unbiased estimator of design variance. Under this design, I would use a variance estimator that only considered the first stage of sampling but there's no singular estimator to use. You could consider using the variance estimator that assumes the first stage was actually PPS WR. For other estimators, see Wolter (2007) Introduction to Variance Estimation, Section 8.7.

    You should check Wolter to see what assumptions and implications lay behind assuming a PPS WR design when you actually used a systematic PPS.

    In practice, I've seen the PPS WR variance estimator used and I've seen a Jackknife replicate variance estimator used where pseudo-strata are formed by pairing first stage PSUs and replicate weights produced from that.

    As a general comment, when you have more than two stages of sampling or more than one stage of non-simple random sampling, the variance estimation process becomes very complicated (see Sarndal, Swensson, and Wretman (1992.) Model Assisted Survey Sampling, section 4.4 Multistage sampling.) A practical way to avoid this complication is to develop a sampling process that permits you to only focus on the variance contribution from the first stage. One way to do this is to use WR sampling at the first stage and then ignore the subsequent sampling stages (for variance estimation purposes.)

    I'd also like you to consider how you plan to come up with variance estimates. If you plan to use existing software such as SAS and Stata and R, I do not believe that there is support for variance estimation for all possible multistage sample designs. (I'm less certain about R because someone could have created a package for some unique design.) Anyway, if you develop a sample design that requires you to write your own code to generate variance estimates then you should take that into consideration to see if your budget and/or schedule and/or capabilities permit that.



    ------------------------------
    David Wilson
    Director, Statistics
    RTI International
    ------------------------------



  • 3.  RE: Multistage Variability Concerns

    Posted 09-16-2024 13:21

    Thanks David for the comments and I am definitely want to go read up on the two resources you mentioned.

    For starters, I am not the one actually conducting the sampling and/or estimating.  It is my task to support another office in my organization that leads this effort with an outsider's perspective by conducting a Quality Assurance review of their whole process.  It is upon reviewing the sampling and estimation plan (and also looking at a Penn State online course [STAT 506] that uses the Sampling textbook by Steven Thompson) that I had the concerns listed in my initial post.

    As for the software, I know that this has been done for many years (and probably no real change over time) from the effort done in a previous fiscal year (which I was provided as part of my review) is that everything seemed to be done with SAS.  For the current fiscal year though, I think some of the process is being done with R (but I can only see up to the first stage process).  I can say that I am a little light on any SAS specific items (i.e., a lot of Google searching), but I was at least able to get it to the point where I now understand what was being done previously.

    As for the rationale behind the pooling of tertiary units, I honestly don't know why it is there but I can say that the pooling of tertiary units takes place AFTER the secondary units have been sampled.

    As for the variance calculation, the current plan that I am reviewing has a single summation that is based on terms being a third stage ratio calculation which is then scaled up by a set of three multiplicative weights (one based on the first stage PPS but the other two are connected to the second stage).

    • Connected with the tertiary pooling observation above, the two weights relative to the second stage seem to only be based on the secondary units selected and not the secondary units that remain with the selected tertiary units.  For example, suppose a set of three secondary units were selected that create a pool of four tertiary units (i.e., two of the secondary units have one tertiary unit and the other the other secondary unit has two tertiary units).  If only two tertiary units are to be selected in Stage 3, there are situations where at least one of the secondary units leaves the picture (i.e., the two singleton tertiary units are selected) or at least two of the secondary units leave the picture (i.e., the tertiary units that paired).  However, the weighting process for the second stage is still based on all of the selected secondary units.

    • To make matters worse, there are some of the selected primary (and possibly secondary) sampling units that have all of their secondary (and possibly tertiary) sampling units included whereas other primary (secondary) units will only have a subset selected because they have more than others.  An extreme example of this is when a primary sampling unit only has one connected secondary unit that also only has one connected tertiary unit.  This means that the "variable of interest" for some of the primary sampling units is completely known (like what would happen in a clustered sampling plan) where as others have to be estimated because not all secondary units are used.

    As for the first stage PPS, I know that they are sorting the transactions by increasing order, but it is also my belief that a single transaction could be a part of multiple primary sampling units.  For example, suppose there were a set of 9 transactions that had dollar amounts of 1, 2, 3, 4, 5, 6, 7, 8, and 9.  If the interval used is 9, then I think the primary sampling units being considered (where a transaction is included if its dollar amount is shown) are the following:

    • PSU #1 = 1, 4, 6, 7, 9
    • PSU #2 = 2, 5, 6, 8, 9
    • PSU #3 = 2, 5, 6, 8, 9
    • PSU #4 = 3, 5, 7, 8, 9 
    • PSU #5 = 3, 5, 7, 8, 9
    • PSU #6 = 3, 5, 7, 8, 9
    • PSU #7 = 4, 6, 7, 8, 9
    • PSU #8 = 4, 6, 7, 8, 9
    • PSU #9 = 4, 6, 7, 8, 9

    And that of this set of 9 primary sampling units, only 1 of them gets chosen based on the random starting value from 1 to 9.

    Thanks again for the comments and I hope to learn more by reading the suggested resources.



    ------------------------------
    Andrew Tew
    Data Scientist
    ASRC Federal
    ------------------------------