In the final assignment of “Improving DNNs/Optimization Algorithms” we go through some effort to in “mini batch gradient descent” to generate mini batches in a manner that every example from the original batch is found in exactly one of the mini batches processed during an epoch.
Why not just build a mini-batch by sampling at random without replacement from the original batch?
I do not feel we will miss much.
If we have:
Total number of examples M
Size of a mini-batch N
Assuming the fit is perfect, number of minibatches s = M/N
Then the probability p(\text{an example appears in k mini-batches}) is given based on the binomial distribution:
ChatGPT did a quick tabulation (Babbage, FRS, RAS, would freak out!), showing that about 1/3 of the example would be dropped, 1/3 would appear exactly once, and the remainder would appear twice or more:
Yes, that is effectively what we are doing: before every epoch, we randomly shuffle the full dataset and then walk through it taking “slices” of the minibatch size. Note that the minibatch size can be any value from 1 to m. Every sample will appear randomly exactly once.
I have not explored all the random functions available in numpy. How would you implement your version and why would it be better than what we actually are implementing here, taking into account that compute cost also matters.
I think David was suggesting a different approach from what we usually do as Paul explained. David instead was thinking, for each mini-batch, it samples from the full dataset without replacement.
I am assuming that David is well aware of Paul’s approach which is the DLS approach and the widely accepted one, but still he wanted to look into his alternative.
Yes, that would be the idea. No permutation of the full dataset, just grab examples to build the next dataset.
Even the concept of “epoch” would disappear, the set of examples is just a set from which one sample the mini-batches until the neural network is satisfying some termination criterium.
But I’m not sure whether one would win anything, it just seems simpler than doing it so that every example appears exactly once per epoch.
Yes, the formula should be fine. However, the numbers in the table don’t because all columns look to be the same. Were they generated by a LLM?
To validate the formula, it’s also possible to code a simulation program ourselves or with the assistance of a LLM.
It might look simpler in one way, but on the other hand, rephrasing what I said eariler, it also made it more difficult to make sure every example got used in each epoch (or stage of the training). In course 4, there will be a technique which makes some examples less used at the later stage of the training, but the choices are informed and are based on how well/bad the model is predicting the examples. In my opinion, effectively randomly making some examples less used may yield unexpected outcome which is to be seen in empirical studies.
import tqdm
import numpy as np
M = 1024 * 64
N = 256
s = int(M / N)
epochs = 1000
rng = np.random.default_rng(1)
samples = np.arange(M)
occurance_buffer = []
for _ in tqdm.tqdm(range(epochs), desc='Simulating...'):
# simulate the random sampling process
choices = np.concatenate(
[rng.choice(samples, size=N, replace=False, shuffle=False) for _ in range(s)]
)
# Take the number of occurance for each example in every epoch
occurance_buffer.append(
np.bincount(
choices,
minlength=M,
)
)
# Taking the shape of (epochs, M), `occurance` tells us the number of occurence
# of each sample in each epoch
occurance = np.stack(occurance_buffer)
# We then will need to know the counts by the number of occurence
max_occurence = occurance.max()
res = np.stack(
[
# Count the number of examples by their numbers of occurence
np.bincount(
occurance[e],
minlength=max_occurence + 1, # + 1 is needed to accommodate for zero occurence
) for e in range(epochs)
]
) # takes the shape of (epochs, max_occurence + 1)
import pandas as pd
print(
pd.Series(
(res / M).mean(axis=0).round(4), # / M gives us probability; the mean is taken over epochs
index=pd.Series(np.arange(max_occurence + 1), name='k'),
)
)
Yes, Grok generated them and it noted the weird fact that the columns are all the same, but this is because as you increase the size of the mini-batch, the number of mini-batches decreases in proportion, so the probability of finding any example k times in the set of mini-batches stays the same.
Grok is amazing, as it first worked on what I asked it in an incomplete fashion but found the (technically correct) solution too trivial (p being either 0 or 1), then went back, made an additional assumption, and reworked everything properly. I can’t imagine how one would even have attempted to do that with search+heuristics based GOFAI.
The related question if one doesn’t care about epochs but just keeps on generating minibatches with random selection without replacement, i.e. “number of minibatches” x “minibatch size” >> batch size. It’s the same formula (unsurprisingly).