@balaji.ambresh, @rmwkwok, thank you for your guidance answers.
@balaji.ambresh mentioned that the size of the mini-batch impacts the parameters because the Cost is computed as an average of Losses. Therefore, using different mini-batch sizes for each iteration will result in different Costs and consequently different updates of parameters (w, b), affecting the efficiency of model training.
The example referenced by @balaji.ambresh (gradient accumulation), along with some machine learning blogs, helped me understand that the mini-batch size depends on computational constraints, particularly the memory needed to store caches for all the training examples.
‘Mini-batch sizes … are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on… A good default for batch size might be 32’ (https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/).
or:
‘More Hardware (Larger Batches), Less Hardware (Smaller Batches)’ (https://arxiv.org/pdf/1812.06162.pdf).
In other hand the batch size depends on the dataset domain (batches for ImageNet and RL could differs a lot - https://arxiv.org/pdf/1812.06162.pdf
@rmwkwok mentioned that there are discussions about determining the most suitable mini-batch size for state-of-the-art (SOTA) models.
On the OpenAI forum, I came across the following information:
‘By default, the batch size will be dynamically configured to be ~0.2% of the number of examples in the training set, capped at 256 - in general’ ( Why is the default batch size set to 1 for fine-tuning the ChatGPT Turbo model? - API - OpenAI Developer Forum).
Returning to my question:
If I understood correctly, the mechanism of mini-batches impact on parameters is mainly related to the number of examples in the mini-batches.
So, this raises one more question:
Will the organization of the sequence of mini-batches, in other words, not the quantitative but qualitative organization, affect the parameters and therefore the efficiency of model training?
In this case the mini-batch impact mechanism could be described like: increasing in noise level (simpler training examples with less noise at the beginning, followed by more complex training examples)?
Or conversely, will mixed batches be more effective, as they will form the cost function space better?
Can you advice some papers on this topic?