DLS, C1_W4. 'Parameters vs Hyperparameters' lecture

Hello, I have a question about hyperparameters.

In the lecture ‘‘Parameters vs Hyperparameters’ lecture’ mini-batches are mentioned among hyperparameters that impact the parameters (w, b).

So, if the total number of training examples is 6000 and we take mini-batches of size 500, will it result in different weight updates compared to the condition where the mini-batch size is 600?

What is the mechanism of this impact?

I suppose that new training examples bring new information that results in different Losses and hence different weigts’ updates.

However, all examples in the dataset, whether early or late, will be fed to the neural network.

Does the size of mini-batches impact the speed of neural network learning?

Thanks.

1 Like

Look at the compute_cost function in course 1 week 3 assignment 1 notebook and note that the cost is the average of logprobs. Hope this sheds light on the relationship between weight updates and size of mini-batches.

Course 2 covers details about mini-batch gradient descent and tips to pick mini-batch size.

Here’s a mention of gradient accumulation (this method isn’t covered in the specialization) that’ll come in handy when you have limited resources.

2 Likes

Hello @VeronikaS,

I think the following line is very insightful, because the loss surface changes at each mini-batch which includes a different set of data. In other words, throughout one epoch, we are optimizing the neural network to as many different loss surface as the number of mini-batches. As for the impact of mini-batch size, I am sure you want to research on how people discuss the difference between stochastic gradient descent, mini-batch GD, and batch GD, because essentially the difference between them is the batch size. On the other hand, as you go through papers for SOTA models, such as OpenAI’s GPT, sometimes they discuss their choice of batch size, their experience may give you new perspective or reinforce your understanding too.

Cheers,
Raymond

1 Like

@balaji.ambresh, @rmwkwok, thank you for your guidance answers.

@balaji.ambresh mentioned that the size of the mini-batch impacts the parameters because the Cost is computed as an average of Losses. Therefore, using different mini-batch sizes for each iteration will result in different Costs and consequently different updates of parameters (w, b), affecting the efficiency of model training.

The example referenced by @balaji.ambresh (gradient accumulation), along with some machine learning blogs, helped me understand that the mini-batch size depends on computational constraints, particularly the memory needed to store caches for all the training examples.

‘Mini-batch sizes … are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on… A good default for batch size might be 32’ (https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/).

or:

‘More Hardware (Larger Batches), Less Hardware (Smaller Batches)’ (https://arxiv.org/pdf/1812.06162.pdf).

In other hand the batch size depends on the dataset domain (batches for ImageNet and RL could differs a lot - https://arxiv.org/pdf/1812.06162.pdf

@rmwkwok mentioned that there are discussions about determining the most suitable mini-batch size for state-of-the-art (SOTA) models.

On the OpenAI forum, I came across the following information:

‘By default, the batch size will be dynamically configured to be ~0.2% of the number of examples in the training set, capped at 256 - in general’ ( Why is the default batch size set to 1 for fine-tuning the ChatGPT Turbo model? - API - OpenAI Developer Forum).

Returning to my question:

If I understood correctly, the mechanism of mini-batches impact on parameters is mainly related to the number of examples in the mini-batches.

So, this raises one more question:

Will the organization of the sequence of mini-batches, in other words, not the quantitative but qualitative organization, affect the parameters and therefore the efficiency of model training?
In this case the mini-batch impact mechanism could be described like: increasing in noise level (simpler training examples with less noise at the beginning, followed by more complex training examples)?
Or conversely, will mixed batches be more effective, as they will form the cost function space better?

Can you advice some papers on this topic?

Hello @VeronikaS,

I have not tried this strategy myself, so I cannot comment on it. However, if you draw your mini-batches at random, there is a higher chance that each mini-batch follows a similar data distribution as the whole dataset, and if your whole dataset follows a similar data distribution as the population (noted that your whole dataset is still just a subset of the population), then each of your mini-batch is more likely to follow the data distribution of the population. This property might be an advantage for you, because, as I said, the cost surface is co-constructed by the training data, which is the current mini-batch. If your mini-batch is, to some degree, like the population, then that cost surface so-constructed would be more relevant to the problem you are trying to solve?

Cheers,
Raymond

1 Like

Have you seen this ?

1 Like