Sampling with replace=True

Hi All,
I have a question related to a piece of code in C3_W3_Lab_1_Central_Limit_Theorem:

def sample_means...
      # Get a sample of the data WITH replacement
      sample = np.random.choice(data, size=sample_size)

In the comments it is stressed that “WITH replacement” is very important to ensure independence.

  • You take random samples out of the population (the sampling is done with replacement, which means that once you select an element you put it back in the sampling space so you could choose a particular element more than once). This ensures that the independence condition is met.

The way this numpy piece of code works is for example if we have [1,2,3,4,5] then [1, 2, 2] is a possible sample of size 3 where “2” is selected twice. In the lectures wasn’t it all about allowing repeating values in different samples to ensure the independence of the samples?
If I change the code with np.random.choice(…, replace=False) the distribution also looks pretty Gaussian to me (it takes more time to do the sampling).

What am I missing here?

Thanks
Pavel

I haven’t taken this course yet, but based on my knowledge, I would say that each item is sampled with replacement to ensure independence ‘within the sample’.

Hey @Pavel_Zahariev,

Abbreviations

  • Sampling with Replacement → SWHR
  • Sampling without Replacement → SWOR
  • Sampling Distribution of Sampling Means → SDSM

And before starting, note that “replacement” ensures a greater extent of independence among the samples, i.e., more likely the SDSM will resemble to a Gaussian distribution.


Now, there are 4 different scenarios which can be laid out here:

  1. For different sample sets (i.e., sample of samples), we use SWHR, and for each of the sample in a single sample set as well, we use SWHR. In other words, a value selected once can repeat in any of the sample sets, including the one it is already selected in. This is the one used in the lab, and told about in the lecture videos.
  2. For different sample sets (i.e., sample of samples), we use SWHR, and for each of the sample in a single sample set, we use SWOR. In other words, a value selected once can’t repeat in the same sample set, but it can be present in other sample sets. This is what you are proposing.
  3. For different sample sets (i.e., sample of samples), we use SWOR, and for each of the sample in a single sample set, we use SWHR. In other words, a value selected once can repeat in the same sample set, but it can’t be present in other sample sets.
  4. For different sample sets (i.e., sample of samples), we use SWOR, and for each of the sample in a single sample set as well, we use SWOR. In other words, a value selected once can’t repeat in any of the sample sets, including the one it is already selected in.

Let’s start out with the 4th scenario. Now, you might believe that in this case, SDSM won’t look like a Gaussian, but the fact is that in certain conditions, it could. Let me quote this article here:

It’s often cited that a sample should be no more than 10% of a population if sampling is done without replacement.

Now, if we take a look at the 2nd and 3rd scenarios, they do offer “replacement” in at least one aspect, as compared to the 4th scenario. In other words, if the 4th scenario, can work out in certain cases, I believe that even the 2nd and 3rd scenarios can at least work out in these conditions, and perhaps even under more relaxed constrains than the 4th scenario.

Now, all of this has made us easier to think about the 1st scenario. In other words, the 1st scenario should work in even more relaxed constraints than the 2nd and 3rd scenarios, since it offers replacement in both aspects.

Cheery on the cake, it is the easiest to implement, and most likely, the most computationally efficient too. So, why would we like to pick any other scenario? Let me know, if you have any reason in mind.

Cheers,
Elemento

1 Like

Hi Elemento,
Thanks for this detailed answer!
I don’t have any reason in mind. Sampling with replacement is something new to me, as I am lacking statistical background. In my mind sample of three people heights, for example, means three different people. The following is taken from here:

Sampling without replacement – Selected subjects will not be in the “pool” for selection. All selected subjects are unique. This is the default assumption for statistical sampling.

Probably I am struggling to understand that concept of independence, I thought that it was completely covered in case 2 as you numbered them in your explanation. I should read a bit more on the matter I guess. Do you think there is something in the videos that I am missing.

Thanks
Pavel

Hey @Pavel_Zahariev,

Please don’t interpret this as the fact that “Sampling without replacement” is more commonly done than the other way around. It just means that when nothing is mentioned about “Sampling” and how it is done, we can assume that it is done without replacement.

As to this, note that in the 2nd scenario, when the ratio Sample Size / Population is small, then for a single sample set, SWOR is more or less same as SWHR, since the possibility of the same value being randomly sampled again is quite less. Moreover, since in different sample sets, the samples can be repeated (using SWHR), hence, we don’t need to worry about the number of sample sets at all.

Lastly, regarding this, we can easily understand why “SWOR might create an issue with the independence of sample sets” with the help of a simple example.

Population = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
Sample_1 = {1, 9, 4}
Remaining Population = {2, 3, 5, 6, 7, 8, 10}
Sample_2 = {3, 7, 2}

Here, note that Sample_1 ∩ Sample_2 will always be a null set (in the case of SWOR), i.e., the sample sets are not independent. However, in the case of SWHR, we can state that the sample sets are independent, since each of the sample sets is sampled from the entire population, and not a subset of the population.

Let us know if this helps.

Cheers,
Elemento