ASA Connect

 View Only
Expand all | Collapse all

A samping question

  • 1.  A samping question

    Posted 13 days ago

    A colleague is interested in the following situation.  There is a finite set of n points, each with a certain weight.  A weighted random sample of k points is drawn without replacement.  What is the probability that a given x will be in this sample? 

    Is this a known problem?  Any information or references would be appreciated.



    ------------------------------
    Virginia Rovnyak
    ------------------------------


  • 2.  RE: A samping question

    Posted 13 days ago

    it depends on how the sampling occurs.

    For k=2, several sampling methods (Brewer, Rao, and Durbin) show the probability of selection is 2*weight of x / sum of weights

    There are other means of selecting samples of size 2 that don't yield the same probability of selection as Brewer, Rao, and Durbin's methods.

    There is a method by Sunter that supports k>2 but the form of the selection probabilities are not all proportional to the weight of x.

    see Sunter, A.B. (1977b). List sequential sampling with equal or unequal probabilities without replacement. Applied Statistics 26, 261-268.

    Sarndal, Swensson, and Wretman in Model Assisted Survey Sampling talk about samples of size 1, 2, and n>2 and is a good reference.



    ------------------------------
    David Wilson
    ------------------------------



  • 3.  RE: A samping question

    Posted 13 days ago

    Virginia, here's a reply from Perplexity.AI that describes the problem, outlines a solution, and provides some references.

    https://www.perplexity.ai/search/a-colleague-is-interested-in-t-uckvx4wMSrKXKnhIyD3blw



    ------------------------------
    Paul Auclair
    Corporate Operations Research Analyst
    LinQuest Corporation
    ------------------------------



  • 4.  RE: A samping question

    Posted 11 days ago

    Virginia, I think the following R program should provide reasonable estimates for the probabilities for each point (note: they do not add to one). Also note that points with the same weights should have the same selection probabilities. This could be achieved by averaging probabilities for points with the same weights

    ####################

    n <- 6 ; k <- 2  # choose n and k
    x <- 1:n # number the points from 1 to n
    w <- c(1/12,1/12,2/12,3/12,3/12,2/12) # given weights, assumed to be sampling probabilities
    nsim <- 10000 # choose number of simulations
    M <- matrix(nrow = nsim,ncol = k, byrow = TRUE) # M is nsim by k, lists points selected for each simulation
    for(i in 1:nsim){
      y <- sample(x,k,w,replace = F)
      M[i,] <- y
    }
    MT <- table(M)
    MT/nsim #gives the estimated probability for each point to be selected

    ###################



    ------------------------------
    [Giles] [Warrack]
    [Retired]
    [North Carolina A&T State University]
    ------------------------------



  • 5.  RE: A samping question

    Posted 10 days ago

    Note that the exact inclusion probabilities for Anthony Warrack 's R code when k=2 can be obtained using his notation with

    w*(1 - w/(1 - w) + sum(w/(1 - w)))

    Here is some Mathematica code to obtain exact inclusion probabilities for other values of k.  (Note that with even moderate values of k and n the calculations might take longer that the age of the earth.)

    (* Function to calculate probability of each sample *)
    pr[indices_, p_] := Module[{remaining, prob},
      remaining = 1 - p[[indices[[1]]]];
      prob = p[[indices[[1]]]];
      Do[prob = prob*p[[indices[[j]]]]/remaining; 
       remaining = remaining - p[[indices[[j]]]], {j, 2, Length[indices]}];
      prob
      ]
     
    (* Function to calculate inclusion probabilities *)
    inclusionProbabilities[p_, k_] := Module[{perms, allProb, data},
      If[k == 2, p (1 - p/(1 - p) + Total[p/(1 - p)]),
       perms = Permutations[Range[Length[p]], {k}];  (* Generate all possible samples *)
       allProb = pr[#, p] & /@ perms ; (* Get probabilities of every possible sample: these sum to 1 *)
       data = Transpose[{perms, allProb}];  (* 
       Combine samples with associated probability of selection *)
       (* Find the probabilities of inclusion for each element *)
       Table[Total[Select[data, MemberQ[#[[1]], i] &][[All, 2]]], {i, 1, Length[p]}]]
      ]
     
    (* Samples of size 3 *)
    p = {1/12, 1/12, 2/12, 3/12, 3/12, 2/12};
    inclusionProbabilities[p, 3]
    (* {8203/27720,8203/27720,1819/3465,1255/1848,1255/1848,1819/3465} *)
    inclusionProbabilities[p, 3] // N
    (* {0.295924, 0.295924, 0.524964, 0.679113, 0.679113, 0.524964} *)

    Here is a slightly larger example:

    p = RandomVariate[UniformDistribution[{0, 1}], 20];
    p = p/Total[p]

    (* {0.0423276, 0.0266268, 0.0823042, 0.0855377, 0.0168512, 0.0535499, 0.068562, 0.0471883, 0.0100306,

         0.0546233, 0.0347716, 0.0926354, 0.00674148, 0.0390817, 0.0559113, 0.0806157, 0.0890967, 0.00176155,

         0.0600508, 0.0517321} *)


    inclusionProbabilities[p, 4]

    (* {0.176228, 0.113839, 0.318929, 0.329457, 0.0732076, 0.218636, 0.272497, 0.194821, 0.044056, 0.222596,

         0.146647, 0.352041, 0.0297645, 0.163622, 0.227326, 0.313371, 0.340871, 0.00783841, 0.242363, 0.211891} *)



    ------------------------------
    Jim Baldwin
    Retired
    ------------------------------



  • 6.  RE: A samping question

    Posted 9 days ago

    The question comes from a physical situation, where values of k would be in the 100-300 range, out of n=500-1,000.  An asymptotic formula would be the best.  The sampling will be done sequentially.



    ------------------------------
    Virginia Rovnyak
    ------------------------------



  • 7.  RE: A samping question

    Posted 9 days ago

    Virginia,

    Here's an example with n = 800, k = 300. The times are not huge.

     #user  system elapsed 
     #2.121   0.062   2.297

    n <- 800 ; k <- 300
    x <- 1:n
    w <- rep(c(1,2,3,3,4,5,2,6), each=100)
    w <- w/sum(w)
    nsim <- 10000
    M <- matrix(nrow = nsim,ncol = k, byrow = TRUE)
    ##############
    system.time(
    for(i in 1:nsim){  
    y <- sample(x,k,w,replace = F)  
    M[i,] <- y }
    )
    ###############
    MT <- table(M)
    MT/nsim


    ------------------------------
    [Giles] [Warrack]
    [Retired]
    [North Carolina A&T State University]
    ------------------------------



  • 8.  RE: A samping question

    Posted 9 days ago

    Here's Anthony's code with minor modifications.  Weights are used to create a population with an appropriate number of repeats. We can make this work for any set of given weights. The estimated probabilities sum to one when normalized by k. There must be a theoretical justification for this (I don't know what it is, though ;-) Also, marginally the weights seem to be equal to the original weights we started with--which should be expected, I think. 

    n <- 6 ; k <- 4  # choose n and k
    x <- 1:n # number the points from 1 to n
    w <- c(1/12,1/12,2/12,3/12,3/12,2/12) # given weights, assumed to be sampling probabilities
    fw <- w*12                            # Multiply by a number to make all fractions into whole numbers
    fx <- rep(x,fw)                       # Create a population vectors with required number of repeats
    nsim <- 100000 # choose number of simulations
    M <- matrix(nrow = nsim,ncol = k, byrow = TRUE) # M is nsim by k, lists points selected for each simulation
    for(i in 1:nsim){
    #  y <- sample(x,k,w,replace = F)
       y <- sample(fx,k,replace = F)
      M[i,] <- y
    }
    MT <- table(M)
    MT/nsim #gives the estimated probability for each point to be selected
    (MT/nsim)/k #gives the estimated probability for each point to be selected
    sum((MT/nsim)/k)                       #Sum of probability check

    Nagaraj 



    ------------------------------
    Nagaraj Neerchal
    Professor and Chair
    UMBC
    ------------------------------



  • 9.  RE: A samping question

    Posted 13 days ago

    Hello Virginia, this is a unique and interesting and quite cute problem. I thought about it for 10 or 15 minutes and did not have an answer. A practical answer is to run a simulation to get a very good approximation, but let's take on the challenge of getting an answer in a closed form.

    Nayak



    ------------------------------
    Nayak Polissar
    Principal Statistician
    The Mountain-Whisper-Light: Statistics & Data Science
    ------------------------------



  • 10.  RE: A samping question

    Posted 13 days ago
      |   view attached

    Hi Virginia,

    The problem is likely not new.  Some relevant ideas are covered in the following link:

    https://math.stackexchange.com/questions/2729561/probability-of-an-unordered-sample-under-weighted-sampling-without-replacement

    A paper linked below is not about your problem, but is somehow related too:

    https://www.sciencedirect.com/science/article/pii/S002001900500298X

    Efraimidis, P.S., Spirakis, P.G. (2006). Weighted random sampling with a reservoir.  Information Processing Letters, 97(5), 181-185.

    I typed some elementary notes about the algorithm to calculate the target probability, it uses the same logic as in the  stackexchange link; see attached. With large k, the implementation will be messy, but I doubt that it's possible to simplify the summation formulas.

    Cheers,

    Sergei



    ------------------------------
    Sergei Leonov
    Head, Statistical Innovation
    CSL Behring
    ------------------------------

    Attachment(s)

    pdf
    SL_ASA_RandomS.pdf   99 KB 1 version


  • 11.  RE: A samping question

    Posted 12 days ago
    The answer depends on what sampling algorithm you use; there are many.
    Some give selection probabilities equal to weight * sample size.
    The greedy algorithm used by R's sample(), sampling each observation with probabilities proportional to weights for items not yet selected, has overall selection probabilities that are in general intractable, but are smaller than weight * sample size for observations with large weight.
    @Book{brew83,
      author = {K. R. W. Brewer and Muhammad Hanif},
      title = {Sampling with unequal probabilities},
      publisher = {Springer-Verlag},
      year = 1983,
      volume = 15,
      series = {Lecture Notes in Statistics},
      annote = {Describes 50 different procedures for sampling without replacement with unequal probabilities}
    }


    ------------------------------
    Tim Hesterberg
    Staff Data Scientist
    Instacart
    ------------------------------



  • 12.  RE: A samping question

    Posted 12 days ago

    Many thanks to everyone.  I have passed all of your responses on to my colleague.  It is clear that the answer depends on the sampling method.  I took a sample of 2 without replacement from a set of 3, using a sequential method of sampling, and I got a different answer for the desired probability from the one using the formula given by Perplexity.AI. 



    ------------------------------
    Virginia Rovnyak
    ------------------------------



  • 13.  RE: A samping question

    Posted 12 days ago

    You might want to consult Chapter 6 entitled "Unequal Probability Sampling" of the textbook "Sampling, Third Edition" by Steven K. Thompson.(Wiley, 2012).  On  page 71, look at example 2 involving the Horvitz-Thompson estimator.



    ------------------------------
    Stephen Mansour
    ------------------------------