Sampling with replace=True

Pavel_Zahariev · July 9, 2023, 8:15pm

Hi All,
I have a question related to a piece of code in C3_W3_Lab_1_Central_Limit_Theorem:

def sample_means...
      # Get a sample of the data WITH replacement
      sample = np.random.choice(data, size=sample_size)

In the comments it is stressed that “WITH replacement” is very important to ensure independence.

You take random samples out of the population (the sampling is done with replacement, which means that once you select an element you put it back in the sampling space so you could choose a particular element more than once). This ensures that the independence condition is met.

The way this numpy piece of code works is for example if we have [1,2,3,4,5] then [1, 2, 2] is a possible sample of size 3 where “2” is selected twice. In the lectures wasn’t it all about allowing repeating values in different samples to ensure the independence of the samples?
If I change the code with np.random.choice(…, replace=False) the distribution also looks pretty Gaussian to me (it takes more time to do the sampling).

What am I missing here?

Thanks
Pavel

Mujassim_Jamal · July 10, 2023, 11:53am

I haven’t taken this course yet, but based on my knowledge, I would say that each item is sampled with replacement to ensure independence ‘within the sample’.

Elemento · July 10, 2023, 2:12pm

Hey @Pavel_Zahariev,

Abbreviations

Sampling with Replacement → SWHR
Sampling without Replacement → SWOR
Sampling Distribution of Sampling Means → SDSM

And before starting, note that “replacement” ensures a greater extent of independence among the samples, i.e., more likely the SDSM will resemble to a Gaussian distribution.

Now, there are 4 different scenarios which can be laid out here:

For different sample sets (i.e., sample of samples), we use SWHR, and for each of the sample in a single sample set as well, we use SWHR. In other words, a value selected once can repeat in any of the sample sets, including the one it is already selected in. This is the one used in the lab, and told about in the lecture videos.
For different sample sets (i.e., sample of samples), we use SWHR, and for each of the sample in a single sample set, we use SWOR. In other words, a value selected once can’t repeat in the same sample set, but it can be present in other sample sets. This is what you are proposing.
For different sample sets (i.e., sample of samples), we use SWOR, and for each of the sample in a single sample set, we use SWHR. In other words, a value selected once can repeat in the same sample set, but it can’t be present in other sample sets.
For different sample sets (i.e., sample of samples), we use SWOR, and for each of the sample in a single sample set as well, we use SWOR. In other words, a value selected once can’t repeat in any of the sample sets, including the one it is already selected in.

Let’s start out with the 4th scenario. Now, you might believe that in this case, SDSM won’t look like a Gaussian, but the fact is that in certain conditions, it could. Let me quote this article here:

It’s often cited that a sample should be no more than 10% of a population if sampling is done without replacement.

Now, if we take a look at the 2nd and 3rd scenarios, they do offer “replacement” in at least one aspect, as compared to the 4th scenario. In other words, if the 4th scenario, can work out in certain cases, I believe that even the 2nd and 3rd scenarios can at least work out in these conditions, and perhaps even under more relaxed constrains than the 4th scenario.

Now, all of this has made us easier to think about the 1st scenario. In other words, the 1st scenario should work in even more relaxed constraints than the 2nd and 3rd scenarios, since it offers replacement in both aspects.

Cheery on the cake, it is the easiest to implement, and most likely, the most computationally efficient too. So, why would we like to pick any other scenario? Let me know, if you have any reason in mind.

Cheers,
Elemento

Pavel_Zahariev · July 10, 2023, 3:43pm

Hi Elemento,
Thanks for this detailed answer!
I don’t have any reason in mind. Sampling with replacement is something new to me, as I am lacking statistical background. In my mind sample of three people heights, for example, means three different people. The following is taken from here:

Sampling without replacement – Selected subjects will not be in the “pool” for selection. All selected subjects are unique. This is the default assumption for statistical sampling.

Probably I am struggling to understand that concept of independence, I thought that it was completely covered in case 2 as you numbered them in your explanation. I should read a bit more on the matter I guess. Do you think there is something in the videos that I am missing.

Thanks
Pavel

Elemento · July 12, 2023, 3:19am

Hey @Pavel_Zahariev,

Please don’t interpret this as the fact that “Sampling without replacement” is more commonly done than the other way around. It just means that when nothing is mentioned about “Sampling” and how it is done, we can assume that it is done without replacement.

As to this, note that in the 2nd scenario, when the ratio Sample Size / Population is small, then for a single sample set, SWOR is more or less same as SWHR, since the possibility of the same value being randomly sampled again is quite less. Moreover, since in different sample sets, the samples can be repeated (using SWHR), hence, we don’t need to worry about the number of sample sets at all.

Lastly, regarding this, we can easily understand why “SWOR might create an issue with the independence of sample sets” with the help of a simple example.

Population = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
Sample_1 = {1, 9, 4}
Remaining Population = {2, 3, 5, 6, 7, 8, 10}
Sample_2 = {3, 7, 2}

Here, note that Sample_1 ∩ Sample_2 will always be a null set (in the case of SWOR), i.e., the sample sets are not independent. However, in the case of SWHR, we can state that the sample sets are independent, since each of the sample sets is sampled from the entire population, and not a subset of the population.

Let us know if this helps.

Cheers,
Elemento

Topic		Replies	Views
RF Sampling with replacement duplicate rows Advanced Learning Algorithms week-4	2	487	July 29, 2022
Construction of mini-batches: could one just "sample w/o replacement from the full batch"? Improving Deep Neural Networks: Hyperparameter tun week-2 , coursera-platform	11	45	March 13, 2025
Course 5, Week 1, Assignment 2, Exercise 2: Question about step 3-sampling Sequence Models coursera-platform	6	622	November 22, 2021
Sampling novel sequence using np.random.choice Sequence Models coursera-platform	4	480	June 2, 2023
Week1-Assignment2-Sampling-np.random.choice() Sequence Models coursera-platform	1	498	July 21, 2022

Sampling with replace=True

Abbreviations

Related topics