Seasonality, window, and batch size

The sunspot dataset includes a seasonal cycle of about 11 years. Once, Laurence experimented with the window size to include a full cycle for training (extending window size to 132/ ~ 11 years). This seemed quite intuitive, but, ultimately, hurt training performance. In conjunction, Laurence also experimented with different batch sizes.

While I generally understand the technical meaning of a window and batch size, its conceptual meaning (in context) appears not to be fully clear to me. What does it mean, when we change the batch size for the learning process (apart from efficiency)?

It seems to affect learning effectiveness as well. I would assume a longer batch/ window, implies a longer series from which the algorithm can learn. Could you explain a bit more about the meaning of window and batch size and their difference in this context? Which context indicates a larger window and/ or a larger batch size?

Besides, how can setting the window to 132 months hurt performance? Intuitively, I would have thought the training is slowed down, but the learning is improved as now for each prediction it can take into account and learn from all the information of the prior 132 months and the LSTM can make its own smart selections with regards to which information to keep or throw away using its memory state. I must have a wrongful understanding in some aspect, and will be grateful to learn more :pray:

window_size refers to the input shape of the model. Depending on the data, capturing either the full seasonality (which is often done) or smaller windows is your decision to make.

batch size refers to the number of rows the model can process in 1 shot. This is the same concept since the 1st lab in course 1 and so please retain that definition. In tensorflow, default batch size is 32.

When batch size == len(training data), we perform a batch gradient descent. Gradients are smooth and so it’s okay to have a higher learning rate at this point. The downside is that the entire dataset needs to be held in memory and we need to do a full pass over the dataset before an update to the model parameters.

When batch size == 1, we do a stochastic gradient descent. Gradients are noisy and so we pick a smaller learning rate. The upside is that gradients are updated often and the memory / CPU requirements are the lowest.

When 1 < batch size < len(training data), it’s called mini batch gradient descent. This approach combines the best of both worlds and almost always converges faster when compared to the other 2 with a good choice of learning rate.

There’s one more hyperparameter to keep in mind. It has to do with number of steps in the future you want to forecast (aka horizon). In the course notebooks, we predict 1 step into the future. Depending upon your need, it’s okay to design a model that predicts multiple steps into the future i.e. horizon > 1.

Once the assignment is complete, why don’t you try using a bigger window size of 132?

2 Likes

I think changing batch size is not only learning process but also performance and time taken for training our model to have better cost efficacy.

the reason Laurence chose 11 year for this was not only intuitive but also because the entire Sun from North Pole to South Pole is a giant magnet, but it’s not a simple one. The Sun’s magnetic fields are on the move, so that approximately every 11 years the entire field flips, and the north and south magnetic poles switch. Another 11 years and the poles switch back again. So basically this is the reason for choosing 11 years.

Happy Learning!!!
Regards
DP

1 Like

@balaji.ambresh: Thank you so much for your detailed explanations!! I have also tried using a larger window size as well (as shown below).
@Deepti_Prasad: That makes a lot of sense, thank you! What an interesting case to work on and experiment with!

My take-aways and remaining questions:

Batch and window sizes are not related! Each batch contains a series of values covering the specified window. More batches mean just processing more tr_examples at once. Despite this, in the course I have often seen them increased analogously. Is there any reason?

While batch processing is a method to speed up learning, when using Stochastic Gradient Descent (SGD) it also affects the learning performance by influencing the number of steps until gradients are updated. Thereby influencing the frequency in which the gradients are making turns into one or another directionally. Besides, the batch size determines the amount of information taken into consideration at this point (number of training examples). In all, setting the batch size neither too low nor too high is an important optimization objective.

Remaining question: Why isn’t a larger window size always better? The only too explanations which come to my mind are overfitting and the larger the window size the fewer training examples as later training examples would not make a full window, which may especially become problematic for testing with smaller data sets.

Here the results of my training on the sunspot dataset:
window_size: 30, 60, 132
batch_size: 32, 64, 100
=> mae: 16.9, 14.34, 14.07

Batch sizes for the assignments are probably picked after some experimentation by the staff. Unless the staff ask you not to change the batch size, consider playing around with powers of 2.

The default learning rate provided by frameworks like tensorflow assume batch size of 32.

There’s nothing wrong in picking a larger batch size. In fact, a few frameworks help with picking the largest batch size your GPU can process.
One algorithm that’s often used is to pick batch sizes in increments of powers of 2 and run a dummy forward pass through the model till an out of memory error occurs or the entire dataset is consumed. The batch size just before the breaking point or the size of the entire dataset is the largest batch size your GPU supports.

Once you find the largest batch size, don’t forget to adjust the learning rate to speed up model converge (adam is a widely used optimizer).

Picking a good window size requires domain knowledge / hyper parameter search. Here are 2 more things to consider:

  1. Drop windows that don’t perfectly match the window size (i.e. drop_remainder=True).
  2. Overlapping windows.
1 Like