Understanding Sampling With and Without Replacement (Python)

Sampling with replacement can be defined as random sampling that allows sampling units to occur more than once. Sampling with replacement consists of

  1. A sampling unit (like a glass bead or a row of data) being randomly drawn from a population (like a jar of beads or a dataset).
  2. Recording which sampling unit was drawn.
  3. Returning the sampling unit to the population.

The reason why the sampling unit is returned to the population before the next sampling unit is drawn is to make sure the probability of selecting any particular sampling unit remains the same in future draws. There are many applications of sampling with replacement throughout data science. Many of these applications use bootstrapping which is a statistical procedure that uses sampling with replacement on a dataset to create many simulated samples. Datasets that are created with sampling with replacement so that they have the same number of samples as the original dataset are called bootstrapped datasets. Bootstrapped data is used in machine learning algorithms like bagged trees and random forests as well as in statistical methods like bootstrapped confidence intervals, and more.

This tutorial will dive into sampling with and without replacement and will touch on some common applications of these concepts in data science. As always, the code used in this tutorial is available on my GitHub. With that, let’s get started!

What is Sampling with Replacement

Sampling with replacement can be defined as random sampling that allows sampling units to occur more than once. Sampling with replacement consists of

  1. A sampling unit (like a glass bead or a row of data) being randomly drawn from a population (like a jar of beads or a dataset).
  2. Recording which sampling unit was drawn.
  3. Returning the sampling unit to the population.

Imagine you have a jar of 12 unique glass beads like in the image above. If you are sampling with replacement from the jar, the chance of randomly selecting any 1 of the glass beads is 1/12. After selecting a bead, return it to the jar so that the probability of selecting any of the 12 beads in future sampling doesn’t change (1/12). This means that if you repeat the process it is entirely possible you could randomly take out the same bead (1/12 chance in this case).

This remaining parts of this section go over how sampling with replacement can be done using the Python libraries NumPy and Pandas and will go over related concepts like bootstrapped datasets and how many duplicate samples should you expect when sampling with replacement to create a bootstrapped dataset.

Sampling with Replacement using NumPy

In order to better understand sample with replacement, let’s now simulate this process with Python. The code below loads NumPy and samples with replacement 12 times from a NumPy array containing unique numbers from 0 to 11.

import numpy as np
np.random.seed(3)
# a parameter: generate a list of unique random numbers (from 0 to 11)
# size parameter: how many samples we want (12)
# replace = True: sample with replacement
np.random.choice(a=12, size=12, replace=True)

Notice how in the image above, there are multiple repeating numbers. The reason why we sampled 12 times in the code above is because the original jar (dataset) we are sampling from has 12 beads (sampling units) in it. The 12 marbles we selected are now part of a bootstrapped dataset which is a dataset that is created with sampling with replacement that has the same number of values as the original dataset.

Sampling with Replacement using Pandas

Since most people aren’t interested in the application of sampling beads out of a jar, it is important to mention a sampling unit can also be something like an entire row of data. The code below creates a bootstrapped dataset using Kaggle’s King County dataset which contains the price at which houses were sold for in King County, which includes Seattle between May 2014 and May 2015. You can download the dataset from Kaggle or load it from my GitHub.

### Import libraries
import numpy as np
import pandas as pd# Load dataset
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)
# Selecting columns I am interested in
columns= ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df = df.loc[:, columns]# Only want to use 15 rows of the dataset for illustrative purposes.
df = df.head(15)# Notice how we have 3 rows with the index label 8
df.sample(n = 15, replace = True, random_state=2)

Notice there are multiple repeating rows.

How many duplicate samples/rows should you expect when sampling with replacement to create a bootstrapped dataset?

Bootstrapped data is used in Bagged Trees and usually Random Forest models (this could be a diagram of either Bagged Trees or Random Forests depending on how the decision trees are fit on the data).

It is important to note that when you do sample with replacement to generate data you will likely get duplicate samples/rows. In practice, the average bootstrapped dataset contains about 63.2% of the original rows. This means that for any particular row of data in the original dataset, 36.8% of the bootstrapped datasets will not contain it.

This subsection briefly shows how you can derive these numbers statistically and as well as get close to them by experiment using the Python library pandas.

Basic Statistics

Let’s start by deriving how for any particular row of data in the original dataset, 36.8% of the bootstrapped datasets will not contain that row.

Assume there are N rows of data in the original dataset. If you want to create a bootstrapped dataset, you need to sample with replacement N times.

For a SINGLE sample with replacement, the probability that a particular row of data is not randomly sampled with replacement from the dataset is

Since a bootstrapped dataset is obtained by sampling N times from a dataset of size N, we need to sample N times to find the probability that a particular row is not chosen in a given bootstrapped dataset.

If we take the limit as N goes to infinity , we find that the probability is .368.

The probability that any particular row of data from the original dataset would be in the bootstrapped dataset is just 1 — 𝑒^-1 = .63213. Note that in real life, the larger your dataset is (the larger N is), the more likely you will get close to these numbers.

Using pandas

The code below uses pandas to show that a bootstrapped dataset will contain about 63.2% of the original rows.

# Import libraries
import numpy as np
import pandas as pd# Load dataset
url = 'https://raw.githubusercontent.com/mGalarnyk/Tutorial_Data/master/King_County/kingCountyHouseData.csv'
df = pd.read_csv(url)
# Selecting columns I am interested in
columns= ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','price']
df = df.loc[:, columns]"""
Generate Bootstrapped Dataset (dataset generated with sample with replacement which has the same number of values as original dataset)
% of original rows will vary depending on random_state
"""
bootstrappedDataset = df.sample(frac = 1, replace = True, random_state = 2)

In the bootstrap sample below, note that it contains about 63.2% of the original samples/rows. This is because the sample size was large (len(df) is 21613). This also means that each bootstrapped dataset will not include about 36.8% of the rows from the original dataset.

len(df)

len(bootstrappedDataset.index.unique()) / len(df)

What is Sampling without Replacement

Sampling without replacement can be defined as random sampling that DOES NOT allow sampling units to occur more than once Let’s now go over a quick example of how sampling without replacement works.

Imagine you have a jar of 12 unique glass beads like in the image above. If you are sampling without replacement from the jar, the chance of randomly selecting any 1 of the glass beads is 1/12. After selecting a bead, it is NOT returned to the jar so that the probability of selecting any of the remaining 11 beads in future sampling is now (1/11). This means that for each additional sample drawn, there are less and less beads in the jar until eventually there are no more beads to sample (after 12 samplings).

Sampling without Replacement using NumPy

In order to ingrain this knowledge, let’s now simulate this process with Python. The code below loads NumPy and samples without replacement 12 times from a NumPy array containing unique numbers from 0 to 11

import numpy as np
np.random.seed(3)# a parameter: generate a list of unique random numbers (from 0 to 11)
# size parameter: how many samples we want (12)
# replace = False: sample without replacement
np.random.choice(a=12, size=12, replace=False)

Notice how there aren’t repeating numbers.

Note that if you try to generate a sample using sampling WITHOUT replacement that is longer than the original sample (12 in this case), you will get an error. Going to back to the jar of beads example, you can’t sample more beads than there are in the jar.

np.random.seed(3)
np.random.choice(a=12, size=20, replace=False)

Examples of Sampling without Replacement in Data Science

Sampling without replacement is used throughout data science. One very common use is in model validation procedures like train test split and cross validation. In short, each of these procedures allows you to simulate how a machine learning model would perform on new/unseen data.

The image below shows the train test split procedure which consists of splitting a dataset into two pieces: a training set and a testing set. This consists of randomly sampling WITHOUT replacement about 75% (you can vary this) of the rows and putting them into your training set and putting the remaining 25% to your test set. Note that the colors in “Features” and “Target” indicate where their data will go (“X_train”, “X_test”, “y_train”, “y_test”) for a particular train test split.

If you would like to learn more about train test split, you can check out another community post Understanding Train Test Split.

Conclusion

Understanding the concept of sampling with and without replacement is important in statistics and data science. Bootstrapped data is used in machine learning algorithms like bagged trees and random forests as well as in statistical methods like bootstrapped confidence intervals, and more.

A future tutorials will take some of this knowledge and go over how it is applied to understanding bagged trees and random forests. If you have any questions or thoughts on the tutorial, feel free to reach out in the comments below.