Construction of mini-batches: could one just "sample w/o replacement from the full batch"?

In the final assignment of “Improving DNNs/Optimization Algorithms” we go through some effort to in “mini batch gradient descent” to generate mini batches in a manner that every example from the original batch is found in exactly one of the mini batches processed during an epoch.

Why not just build a mini-batch by sampling at random without replacement from the original batch?

I do not feel we will miss much.

If we have:

  • Total number of examples M
  • Size of a mini-batch N
  • Assuming the fit is perfect, number of minibatches s = M/N

Then the probability p(\text{an example appears in k mini-batches}) is given based on the binomial distribution:

p(k) = \binom{s}{k} \left(\frac{N}{M}\right)^k \left(1 - \frac{N}{M}\right)^{s-k}

ChatGPT did a quick tabulation (Babbage, FRS, RAS, would freak out!), showing that about 1/3 of the example would be dropped, 1/3 would appear exactly once, and the remainder would appear twice or more:

\begin{array}{|c|c|} \hline k & P(X = k) \\ \hline 0 & 0.3677 \\ 1 & 0.3681 \\ 2 & 0.1840 \\ 3 & 0.0613 \\ 4 & 0.0153 \\ 5 & 0.0030 \\ 6 & 0.0005 \\ 7 & 0.00007 \\ 8 & 0.000009 \\ 9 & 0.00000099 \\ 10 & 0.000000098 \\ \hline \end{array}

1 Like

Yes, that is effectively what we are doing: before every epoch, we randomly shuffle the full dataset and then walk through it taking “slices” of the minibatch size. Note that the minibatch size can be any value from 1 to m. Every sample will appear randomly exactly once.

I have not explored all the random functions available in numpy. How would you implement your version and why would it be better than what we actually are implementing here, taking into account that compute cost also matters.

1 Like

I think David was suggesting a different approach from what we usually do as Paul explained. David instead was thinking, for each mini-batch, it samples from the full dataset without replacement.

I am assuming that David is well aware of Paul’s approach which is the DLS approach and the widely accepted one, but still he wanted to look into his alternative.

1 Like

So, David, your work shows that the model couldn’t use 36.7% of the examples in an epoch. On the other hand, it’s slower to do the random sampling.

PS: I edited your post to make the formula more readable.

1 Like

Yes, that would be the idea. No permutation of the full dataset, just grab examples to build the next dataset.

Even the concept of “epoch” would disappear, the set of examples is just a set from which one sample the mini-batches until the neural network is satisfying some termination criterium.

But I’m not sure whether one would win anything, it just seems simpler than doing it so that every example appears exactly once per epoch.

That’s okay, because it’s also completely wrong. :downcast_face_with_sweat:

Fixing…

I’m asking Grok to sheperd me along because ChatGPT doesn’t cut the mustard.

Edit: the formula is right after all. Nice.

Probabilities:

( k ) ( N = 64 ) (s = 1024) ( N = 128 ) (s = 512) ( N = 256 ) (s = 256) ( N = 512 ) (s = 128)
0 0.367879441 0.367879441 0.367879441 0.367879441
1 0.367879441 0.367879441 0.367879441 0.367879441
2 0.183939721 0.183939721 0.183939721 0.183939721
3 0.061313240 0.061313240 0.061313240 0.061313240
4 0.015328310 0.015328310 0.015328310 0.015328310
5 0.003065662 0.003065662 0.003065662 0.003065662
6 0.000510944 0.000510944 0.000510944 0.000510944
7 0.000072992 0.000072992 0.000072992 0.000072992
8 0.000009124 0.000009124 0.000009124 0.000009124
9 0.000001014 0.000001014 0.000001014 0.000001014
10 0.000000101 0.000000101 0.000000101 0.000000101

Yes, the formula should be fine. However, the numbers in the table don’t because all columns look to be the same. Were they generated by a LLM?

To validate the formula, it’s also possible to code a simulation program ourselves or with the assistance of a LLM.

It might look simpler in one way, but on the other hand, rephrasing what I said eariler, it also made it more difficult to make sure every example got used in each epoch (or stage of the training). In course 4, there will be a technique which makes some examples less used at the later stage of the training, but the choices are informed and are based on how well/bad the model is predicting the examples. In my opinion, effectively randomly making some examples less used may yield unexpected outcome which is to be seen in empirical studies.

1 Like

Here is an example:

import tqdm
import numpy as np

M = 1024 * 64
N = 256
s = int(M / N)
epochs = 1000

rng = np.random.default_rng(1)
samples = np.arange(M)

occurance_buffer = []
for _ in tqdm.tqdm(range(epochs), desc='Simulating...'):
    # simulate the random sampling process
    choices = np.concatenate(
        [rng.choice(samples, size=N, replace=False, shuffle=False) for _ in range(s)]
    )
    # Take the number of occurance for each example in every epoch
    occurance_buffer.append(
        np.bincount(
            choices,
            minlength=M,
        )
    )

# Taking the shape of (epochs, M), `occurance` tells us the number of occurence
# of each sample in each epoch
occurance = np.stack(occurance_buffer)

# We then will need to know the counts by the number of occurence 
max_occurence = occurance.max()
res = np.stack(
    [
        # Count the number of examples by their numbers of occurence
        np.bincount(
            occurance[e],
            minlength=max_occurence + 1, # + 1 is needed to accommodate for zero occurence
        ) for e in range(epochs)
    ]
) # takes the shape of (epochs, max_occurence + 1)

import pandas as pd
print(
    pd.Series(
        (res / M).mean(axis=0).round(4), # / M gives us probability; the mean is taken over epochs
        index=pd.Series(np.arange(max_occurence + 1), name='k'),
    )
)

Results:

1 Like

Were they generated by a LLM?

Yes, Grok generated them and it noted the weird fact that the columns are all the same, but this is because as you increase the size of the mini-batch, the number of mini-batches decreases in proportion, so the probability of finding any example k times in the set of mini-batches stays the same.

Grok is amazing, as it first worked on what I asked it in an incomplete fashion but found the (technically correct) solution too trivial (p being either 0 or 1), then went back, made an additional assumption, and reworked everything properly. I can’t imagine how one would even have attempted to do that with search+heuristics based GOFAI. :scream:

:thinking:

But your formula suggested those columns shouldn’t be the same, because the change of N can’t cancel out the change in s.

image

On the other hand, I slightly modified my code and it also showed the columns were different:

Its flow of work seems amazing, but we still need to check its work and results :wink:

1 Like

Indeed we do!

I have to go over this once more.

1 Like

Finally we have the following (hmmm… I renamed “the number of mini-batches” s to L … oh well)

(I can’t upload LibreOffice spreadsheet … so here are screenshots)

The basic question if you subdivide the batch into minibatches with no remainder and random selection without replacement:

A variation of the above with your numbers (larger M, various N)

The related question if one doesn’t care about epochs but just keeps on generating minibatches with random selection without replacement, i.e. “number of minibatches” x “minibatch size” >> batch size. It’s the same formula (unsurprisingly).

Spreadsheet available at: