ASA Connect

View Only

Back to discussions

Expand all | Collapse all

A samping question

1. A samping question

Recommend
Virginia Rovnyak
Posted 08-28-2024 22:12
A colleague is interested in the following situation. There is a finite set of n points, each with a certain weight. A weighted random sample of k points is drawn without replacement. What is the probability that a given x will be in this sample?

Is this a known problem? Any information or references would be appreciated.

------------------------------
Virginia Rovnyak
------------------------------
2. RE: A samping question

Recommend
David Wilson
Posted 08-29-2024 10:03
it depends on how the sampling occurs.

For k=2, several sampling methods (Brewer, Rao, and Durbin) show the probability of selection is 2*weight of x / sum of weights

There are other means of selecting samples of size 2 that don't yield the same probability of selection as Brewer, Rao, and Durbin's methods.

There is a method by Sunter that supports k>2 but the form of the selection probabilities are not all proportional to the weight of x.

see Sunter, A.B. (1977b). List sequential sampling with equal or unequal probabilities without replacement. Applied Statistics 26, 261-268.

Sarndal, Swensson, and Wretman in Model Assisted Survey Sampling talk about samples of size 1, 2, and n>2 and is a good reference.

------------------------------
David Wilson
------------------------------

Original Message
3. RE: A samping question

Recommend
Paul Auclair
Posted 08-29-2024 10:27
Virginia, here's a reply from Perplexity.AI that describes the problem, outlines a solution, and provides some references.

https://www.perplexity.ai/search/a-colleague-is-interested-in-t-uckvx4wMSrKXKnhIyD3blw

------------------------------
Paul Auclair
Corporate Operations Research Analyst
LinQuest Corporation
------------------------------

Original Message
4. RE: A samping question

Recommend
Anthony Warrack
Posted 08-31-2024 13:10
Virginia, I think the following R program should provide reasonable estimates for the probabilities for each point (note: they do not add to one). Also note that points with the same weights should have the same selection probabilities. This could be achieved by averaging probabilities for points with the same weights

####################

n <- 6 ; k <- 2 # choose n and k
x <- 1:n # number the points from 1 to n
w <- c(1/12,1/12,2/12,3/12,3/12,2/12) # given weights, assumed to be sampling probabilities
nsim <- 10000 # choose number of simulations
M <- matrix(nrow = nsim,ncol = k, byrow = TRUE) # M is nsim by k, lists points selected for each simulation
for(i in 1:nsim){
y <- sample(x,k,w,replace = F)
M[i,] <- y
}
MT <- table(M)
MT/nsim #gives the estimated probability for each point to be selected

###################

------------------------------
[Giles] [Warrack]
[Retired]
[North Carolina A&T State University]
------------------------------

Original Message
5. RE: A samping question

Recommend
Jim Baldwin
Posted 09-01-2024 02:48
Note that the exact inclusion probabilities for Anthony Warrack 's R code when k=2 can be obtained using his notation with

w*(1 - w/(1 - w) + sum(w/(1 - w)))

Here is some Mathematica code to obtain exact inclusion probabilities for other values of k. (Note that with even moderate values of k and n the calculations might take longer that the age of the earth.)

(* Function to calculate probability of each sample *)

pr[indices_, p_] := Module[{remaining, prob},

remaining = 1 - p[[indices[[1]]]];

prob = p[[indices[[1]]]];

Do[prob = prob*p[[indices[[j]]]]/remaining;

remaining = remaining - p[[indices[[j]]]], {j, 2, Length[indices]}];

prob

]

(* Function to calculate inclusion probabilities *)

inclusionProbabilities[p_, k_] := Module[{perms, allProb, data},

If[k == 2, p (1 - p/(1 - p) + Total[p/(1 - p)]),

perms = Permutations[Range[Length[p]], {k}]; (* Generate all possible samples *)

allProb = pr[#, p] & /@ perms ; (* Get probabilities of every possible sample: these sum to 1 *)

data = Transpose[{perms, allProb}]; (*

Combine samples with associated probability of selection *)

(* Find the probabilities of inclusion for each element *)

Table[Total[Select[data, MemberQ[#[[1]], i] &][[All, 2]]], {i, 1, Length[p]}]]

]

(* Samples of size 3 *)

p = {1/12, 1/12, 2/12, 3/12, 3/12, 2/12};

inclusionProbabilities[p, 3]

(* {8203/27720,8203/27720,1819/3465,1255/1848,1255/1848,1819/3465} *)

inclusionProbabilities[p, 3] // N

(* {0.295924, 0.295924, 0.524964, 0.679113, 0.679113, 0.524964} *)

Here is a slightly larger example:

p = RandomVariate[UniformDistribution[{0, 1}], 20];
p = p/Total[p]

(* {0.0423276, 0.0266268, 0.0823042, 0.0855377, 0.0168512, 0.0535499, 0.068562, 0.0471883, 0.0100306,

0.0546233, 0.0347716, 0.0926354, 0.00674148, 0.0390817, 0.0559113, 0.0806157, 0.0890967, 0.00176155,

0.0600508, 0.0517321} *)

inclusionProbabilities[p, 4]

(* {0.176228, 0.113839, 0.318929, 0.329457, 0.0732076, 0.218636, 0.272497, 0.194821, 0.044056, 0.222596,

0.146647, 0.352041, 0.0297645, 0.163622, 0.227326, 0.313371, 0.340871, 0.00783841, 0.242363, 0.211891} *)

------------------------------
Jim Baldwin
Retired
------------------------------

Original Message
6. RE: A samping question

Recommend
Virginia Rovnyak
Posted 09-02-2024 11:26
The question comes from a physical situation, where values of k would be in the 100-300 range, out of n=500-1,000. An asymptotic formula would be the best. The sampling will be done sequentially.

------------------------------
Virginia Rovnyak
------------------------------

Original Message
7. RE: A samping question

Recommend
Anthony Warrack
Posted 09-02-2024 13:54
Virginia,

Here's an example with n = 800, k = 300. The times are not huge.

#user system elapsed #2.121 0.062 2.297

n <- 800 ; k <- 300
x <- 1:n
w <- rep(c(1,2,3,3,4,5,2,6), each=100)
w <- w/sum(w)
nsim <- 10000
M <- matrix(nrow = nsim,ncol = k, byrow = TRUE)
##############
system.time(
for(i in 1:nsim){
y <- sample(x,k,w,replace = F)
M[i,] <- y }
)
###############
MT <- table(M)
MT/nsim

------------------------------
[Giles] [Warrack]
[Retired]
[North Carolina A&T State University]
------------------------------

Original Message
8. RE: A samping question

Recommend
Nagaraj Neerchal
Posted 09-02-2024 11:51
Here's Anthony's code with minor modifications. Weights are used to create a population with an appropriate number of repeats. We can make this work for any set of given weights. The estimated probabilities sum to one when normalized by k. There must be a theoretical justification for this (I don't know what it is, though ;-) Also, marginally the weights seem to be equal to the original weights we started with--which should be expected, I think.

n <- 6 ; k <- 4 # choose n and k
x <- 1:n # number the points from 1 to n
w <- c(1/12,1/12,2/12,3/12,3/12,2/12) # given weights, assumed to be sampling probabilities
fw <- w*12 # Multiply by a number to make all fractions into whole numbers
fx <- rep(x,fw) # Create a population vectors with required number of repeats
nsim <- 100000 # choose number of simulations
M <- matrix(nrow = nsim,ncol = k, byrow = TRUE) # M is nsim by k, lists points selected for each simulation
for(i in 1:nsim){
# y <- sample(x,k,w,replace = F)
y <- sample(fx,k,replace = F)
M[i,] <- y
}
MT <- table(M)
MT/nsim #gives the estimated probability for each point to be selected
(MT/nsim)/k #gives the estimated probability for each point to be selected
sum((MT/nsim)/k) #Sum of probability check

Nagaraj

------------------------------
Nagaraj Neerchal
Professor and Chair
UMBC
------------------------------

Original Message
9. RE: A samping question

Recommend
Nayak Polissar
Posted 08-29-2024 13:54
Hello Virginia, this is a unique and interesting and quite cute problem. I thought about it for 10 or 15 minutes and did not have an answer. A practical answer is to run a simulation to get a very good approximation, but let's take on the challenge of getting an answer in a closed form.

Nayak

------------------------------
Nayak Polissar
Principal Statistician
The Mountain-Whisper-Light: Statistics & Data Science
------------------------------

Original Message
10. RE: A samping question

Recommend
Sergei Leonov
Posted 08-29-2024 14:49
| view attached
Hi Virginia,

The problem is likely not new. Some relevant ideas are covered in the following link:

https://math.stackexchange.com/questions/2729561/probability-of-an-unordered-sample-under-weighted-sampling-without-replacement

A paper linked below is not about your problem, but is somehow related too:

https://www.sciencedirect.com/science/article/pii/S002001900500298X

Efraimidis, P.S., Spirakis, P.G. (2006). Weighted random sampling with a reservoir. Information Processing Letters, 97(5), 181-185.

I typed some elementary notes about the algorithm to calculate the target probability, it uses the same logic as in the stackexchange link; see attached. With large k, the implementation will be messy, but I doubt that it's possible to simplify the summation formulas.

Cheers,

Sergei

------------------------------
Sergei Leonov
Head, Statistical Innovation
CSL Behring
------------------------------

Attachment(s)

SL_ASA_RandomS.pdf 99 KB 1 version

Original Message
11. RE: A samping question

Recommend
Tim Hesterberg
Posted 08-29-2024 19:16
The answer depends on what sampling algorithm you use; there are many.

Some give selection probabilities equal to weight * sample size.

The greedy algorithm used by R's sample(), sampling each observation with probabilities proportional to weights for items not yet selected, has overall selection probabilities that are in general intractable, but are smaller than weight * sample size for observations with large weight.

@Book{brew83,

author = {K. R. W. Brewer and Muhammad Hanif},

title = {Sampling with unequal probabilities},

publisher = {Springer-Verlag},

year = 1983,

volume = 15,

series = {Lecture Notes in Statistics},

annote = {Describes 50 different procedures for sampling without replacement with unequal probabilities}

}

------------------------------
Tim Hesterberg
Staff Data Scientist
Instacart
------------------------------

Original Message
12. RE: A samping question

Recommend
Virginia Rovnyak
Posted 08-29-2024 22:09
Many thanks to everyone. I have passed all of your responses on to my colleague. It is clear that the answer depends on the sampling method. I took a sample of 2 without replacement from a set of 3, using a sequential method of sampling, and I got a different answer for the desired probability from the one using the formula given by Perplexity.AI.

------------------------------
Virginia Rovnyak
------------------------------

Original Message
13. RE: A samping question

Recommend
Stephen Mansour
Posted 08-29-2024 22:11
You might want to consult Chapter 6 entitled "Unequal Probability Sampling" of the textbook "Sampling, Third Edition" by Steven K. Thompson.(Wiley, 2012). On page 71, look at example 2 involving the Horvitz-Thompson estimator.

------------------------------
Stephen Mansour
------------------------------

Original Message

ASA Connect

A samping question

Virginia Rovnyak08-28-2024 22:12

David Wilson08-29-2024 10:03

Paul Auclair08-29-2024 10:27

Anthony Warrack08-31-2024 13:10

Jim Baldwin09-01-2024 02:48

Virginia Rovnyak09-02-2024 11:26

Anthony Warrack09-02-2024 13:54

Nagaraj Neerchal09-02-2024 11:51

Nayak Polissar08-29-2024 13:54

Sergei Leonov08-29-2024 14:49

Tim Hesterberg08-29-2024 19:16

Virginia Rovnyak08-29-2024 22:09

Stephen Mansour08-29-2024 22:11

1. A samping question

2. RE: A samping question

3. RE: A samping question

4. RE: A samping question

####################

5. RE: A samping question

####################

6. RE: A samping question

####################

7. RE: A samping question

####################

8. RE: A samping question

####################

9. RE: A samping question

10. RE: A samping question

11. RE: A samping question

12. RE: A samping question

13. RE: A samping question

Contact Us

Membership

Privacy

Follow Us

ASA Connect

A samping question

Virginia Rovnyak08-28-2024 22:12

David Wilson08-29-2024 10:03

Paul Auclair08-29-2024 10:27

Anthony Warrack08-31-2024 13:10

Jim Baldwin09-01-2024 02:48

Virginia Rovnyak09-02-2024 11:26

Anthony Warrack09-02-2024 13:54

Nagaraj Neerchal09-02-2024 11:51

Nayak Polissar08-29-2024 13:54

Sergei Leonov08-29-2024 14:49

Tim Hesterberg08-29-2024 19:16

Virginia Rovnyak08-29-2024 22:09

Stephen Mansour08-29-2024 22:11

1. A samping question

2. RE: A samping question

3. RE: A samping question

4. RE: A samping question

####################

5. RE: A samping question

####################

6. RE: A samping question

####################

7. RE: A samping question

####################

8. RE: A samping question

####################

9. RE: A samping question

10. RE: A samping question

11. RE: A samping question

12. RE: A samping question

13. RE: A samping question

Related Content

RE: A samping question

CSL Behring is looking for Prinicipal Biostatistician!

Short course on estimating propensity scores and inverse probability of treatment weights when drawing causal inferences

Reminder: Short course on estimating propensity scores and inverse probability of treatment weights when drawing causal inferences

Probability/Non-probability Sampling vs. Parametric/Non-parametric Test

Contact Us

Membership

Privacy

Follow Us