No overfitting issue without dropout layers on "Predicting Sunspots with Neural Networks"

Hi,

I have been wondering how could this model was able to get around overfitting problem without any regularization code(dropout layers). Does tuning other hyperparameters like batch size or window size can overcome issue?
If needed, where would it be best place to put dropout layers in code below?

[snippet removed by mentor]

Hey @Justin,

Well there are multiple options that can overcome the overfitting problem (high variance). Regularization is one of the most common options to overcome it but don’t forget the other options for example:

  • More Data can overcome overfitting problem

  • Neural Network (NN) architecture search experiment with the architecture until you find best one that fits your case is another option.

Now come to the second part of your question “Best place to put dropout layers?”
The answer: The best place to put a dropout layer in a neural network architecture is typically after a fully connected layer or a convolutional layer.

Remember that i tell you the best practice and it doesn’t mean that it will be best option for your case. Build your model quickly as possible and start to fine-tune later

So it needs you to experiment with different dropout rates and position until you find best combination for your case.

I hope it’s clear now and feel free to ask for more clarifications.
Regards,
Jamal

@Justin For more clarification to your question “Can Batch size overcome the issue?”

Well batch size can affect overfitting. In general, smaller batch sizes tend to be less prone to overfitting , while larger batch sizes tend to be more prone to overfitting

This is because smaller batch sizes make the model learn more slowly, which gives it more time to explore the data and find a good fit that generalizes well to new data. Larger batch sizes make the model learn more quickly, but this can lead to the model memorizing the training data too well, which can cause it to overfit.

But you need to consider other factors i said and also learning rate and number of epochs so the relationship between batch size and overfitting is not always straightforward

We are back to the same point it needs experiement.

Neural network hyperparameters are the number of hidden layers, neurons per hidden layer, learning rate, and batch size. Hyperparameter tuning methods include grid search, random search and optimisation. As this analysis is a time-series analysis of sunspot prediction each window_size is set to 30 points (equal to 2.5 years) but can be changed later on if you want to experiment.

If you see in the window dataset function, each window_size is flattened and then shuffled which further is divided into batch_size.

LSTM layer is grid search hyperparameter help to prevent overfitting by ignoring randomly selected neurons during training, and hence reduces the sensitivity to the specific weights of individual neurons.

Lambda Layer allows us to perform arbitrary operations to effectively expand the functionality of TensorFlow’s Keras, and we can do this within the model definition itself. So the first Lambda layer will be used to help us with our dimensionality.
If you recall when we wrote the window dataset helper function, it returned two-dimensional batches of Windows on the data, with the first being the batch size and the second the number of timestamps. But an RNN expects three-dimensions; batch size, the number of timestamps, and the series dimensionality. With the Lambda layer, we can fix this without rewriting our Window dataset helper function. Using the Lambda, we just expand the array by one dimension. Similarly, if we scale up the outputs by 400, we can help training. The default activation function in the RNN layers is tan H which is the hyperbolic tangent activation. This outputs values between negative one and one. Since the time series values are in particular order, then scaling up the outputs to the same ballpark can help us with learning. We can do that in a Lambda layer too, we just simply multiply that by a multiples of 100 depending on dataset. This is agains affect the learning rate and well as time series analysis of training model.

Here again in this lab the optimizer used is SGD which is a variant of the Gradient Descent algorithm that is used for optimizing machine learning models. It addresses the computational inefficiency of traditional Gradient Descent methods when dealing with large datasets in machine learning projects.

Momentum of 0.9
If the momentum hyperparameter is set too close to 1 (e.g., 0.99999) when using an SGD optimizer, then the algorithm will likely pick up a lot of speed, hopefully moving roughly toward the global minimum, but its momentum will carry it right past the minimum.

Why dropout is not used in this model because we already have flatten the window batch in window dataset function and our aim in this model to train a model which could predict sunspot will not require this dropout layer again in the model algorithm as it is already been divided in a batch_size with a set window dataset function and the last lamba layer which is a multiple of 100s creating uniform scaling.

Regards
DP

Thanks for detailed answer Jamal022. Appreciate. I :grinning:

Thanks Deepi_Prasad. Appreciate your answer. while I am still digesting your answer here but I got another question.
In this “Predicting Sunspot with NN” on last lambda layer as below.

tf.keras.layers.Lambda(lambda x:x*400)

But why multiply by 400 instead of other figure.
Is it because input data(max value of sun spot) is around 400?)

I am building my own model by using exchange rate. Input series is moving from 1000 to 1500. Then, shall I multiply it by 1500?

Lambda scale_layer does not directly track the scale variable, it will not appear in scale_layer.trainable_weights and will therefore not be trained if scale_layer is used in a Model.

I think one need to understand the main reason of using 400 in the last lambda layer is scaling the output to help with learning rate.

if you noticed this

Set the learning rate scheduler

lr_schedule = tf.keras.callbacks.LearningRateScheduler(
lambda epoch: 1e-8 * 10**(epoch / 20))

this is where the last layer of lambda layer scaling will have an affect. So based on this one need to decided what scaling you want to do for last layer of lambda which again is dependent on your dataset.

The instructors has scaled this based on getting the desired mse and loss output which you can correlate in the graphs.

In the same lab, it also mentioned that the instructor eventually got the desired output with 500 epochs which was again dependent of learning rate.

So you could start with 100s and keep adjusting according to what best your model would train.

Thanks Deepti_Prasad,

So everytime I adjust multiple of lambda, I need to go through lr_schedule = tf.keras.callbacks.LearningRateScheduler

and pick desirable learning rate?

After starting around 100s, Is there any rule of thumb to pick multiple of Lambda size?
If volatility of time series data is big, better go large?

No you don’t need to scale the learning rate every time you change the lambda scale_layer.

Hi Deepti_Prasad,

I have replicated “Predicting Sunspots with Neural Networks” code with Korea exchange rate times series data. (I forgot to amend X and Y axis title. please disregard them. These are daily Korea exchange rate closing rate since 2008)

I have amended Lambda multiple to 100 and picked desirable lr after running

tf.keras.callbacks.LearningRateScheduler

learning_rate = 5.6234e-07

After fitting model epoch=300, I got below result.

last epoch training as below

Epoch 300/300
93/93 [==============================] - 3s 32ms/step - loss: 49.3195 - mae: 49.8155

But mae with Train_Valid show is much bigger
29/29 [==============================] - 1s 16ms/step
98.448715

and also prediction part show very strange result(Whatever x_valid data I input, forecast almost same rate).
Can you see what issue I am having?

Can I send you the python code and source csv file so you can take a look?
Python code are barely changed (except reading data part, lr rate, lambda multiple)
I have been working on this several days but couldn’t figure out why result are wrong.

Hello Justin,

aren’t you using too higher learning rate??? this must be cause higher weight update and affecting performance of the model(loss on training dataset over each epoch).

You MAE and Loss is still not have similar graph parameter.

Can I know your dataset details
like batch size?
keep last lambda layer 100 and reduce the learning rate, and run the model again, see if there is lesser difference between MAE and loss.

I suppose dataset size, splitting of dataset need to be looked upon with other pointers mentioned as before.

Based on your graph, looks like you model is not training uniformly. This can happen if your model architecture has too higher variation in features, or other factors…

Yes you can send code via personal DM, but I am in between some personal task, and it might take some time to give you review on your model as I don’t want to just simply give any solution. I like studying about the matter I work upon and then give a review.

Regards
DP

I picked my learning rate based on lr_schedule = tf.keras.callbacks.LearningRateScheduler running result as below.

Epoch 32/100
93/93 [==============================] - 3s 26ms/step - loss: 926.4095 - lr: 3.5481e-07
Epoch 33/100
93/93 [==============================] - 2s 26ms/step - loss: 839.1348 - lr: 3.9811e-07
Epoch 34/100
93/93 [==============================] - 2s 25ms/step - loss: 670.5697 - lr: 4.4668e-07
Epoch 35/100
93/93 [==============================] - 3s 26ms/step - loss: 288.0587 - lr: 5.0119e-07
Epoch 36/100
93/93 [==============================] - 2s 25ms/step - loss: 49.4215 - lr: 5.6234e-07
Epoch 37/100
93/93 [==============================] - 2s 25ms/step - loss: 49.2459 - lr: 6.3096e-07
Epoch 38/100
93/93 [==============================] - 3s 25ms/step - loss: 49.5009 - lr: 7.0795e-07

I kept same dataset details as example.

Parameters

window_size = 30
batch_size = 32
shuffle_buffer_size = 1000

I didn’t noticed the gap between my MAE and loss from Hubar optizimer are too big.

Let me send you code and csv file. Please take your time.
Any advise will be greatly appreciated.

your dataset seems to be comparatively smaller compare to predicting sunspot, then why are using lambda scale_layer.

Even if you want to use in case your model being a time series analysis then use a smaller scale_layer. Ok the dataset is about stock exchange and not a seasonality based data, then you would not required lambda layer? as it only changes the dimensionality of your window size, like if it is time and temperature analysis. are trying to working on currency and time analysis for this dataset?

I am sorry I am trying to understand your dataset. Probably you could scale exchange rate based on financial year parameters. Sorry I do not have understanding about Korea exchange rate, so need to have more information about why would you use lambda layer in this dataset!!!

Regards
DP

I was able to resolve my issue by normalizing the input data.
Thanks for help Deepi_Prasad.

Happy to help!!

I surely want to know what all changes you did :slight_smile:

Keep Learning!

Regards
DP

Hi Jamal022,

I have a question about last lambda layer which multiply with 400.
Where does this 400 come from. Does this size of figure affect overfitting, underfitting for the model? say bigger number is prone to overfitting etc?

Hey @Justin,

You need to know that overfitting or underfitting in a neural network are generally related to the complexity of the model and the amount of available data.

So the value of 400 in the context of a neural network’s last lambda layer does not inherently relate to overfitting or underfitting directly. Instead, it is typically a design choice made during the development of the neural network architecture, and its specific purpose depends on the task and network design.

It similar to put 10 units at the softmax layer for classification problem when you have 10 classes something related to your case and design of your neural network.

So the value 400 is related to your case provide me with more information about your case so that i can understand more i try to answer your quesiton generally but if you need a specific answer why 400 exactly tell me more about your case.

Cheers!
Jamal

Hi @Justin,

Absolutely, you’re on the right track. Let me elaborate further.

In the context of the neural network you’re referring to, the lambda layer with a multiplication factor of 400 is used for scaling the output. The lambda function in Python is a concise way to define simple functions, and in this neural network, it’s used to adjust the model’s output.

The value 400 is chosen based on domain knowledge and the specific problem being addressed. It relates to the maximum of the Monthly Mean Total Sunspot Number observed over the observation period. By scaling the output with this factor, the model’s predictions are aligned with the typically observed values in the dataset.

Now, regarding your point on scaling aiding in model fitting: Yes, proper scaling can indeed help in better convergence during training. When the outputs are scaled appropriately, it can ensure that the gradients during backpropagation are neither too small nor too large, leading to more stable and faster convergence. This is especially true when using activation functions that are sensitive to the scale of input values, as it can prevent the values from reaching saturated regions of the activation function where gradients are near zero.

However, it’s essential to note that while scaling can aid in convergence and potentially lead to a better fit, it doesn’t directly influence overfitting or underfitting. Those are more related to the model’s complexity, the amount of training data, and how well the model generalizes to unseen data. But ensuring that the model’s output and, in general, the input features are in a meaningful and appropriate range can certainly help in achieving better training dynamics.

I hope this aids your understanding and much success with your further learning!

Best regards,
Steffen

1 Like

The International Sunspot Number, R I, is the key indicator of solar activity. This is not because everyone agrees that it is the best indicator but rather because of the length of the available record. Traditionally, sunspot numbers are given as daily numbers, monthly averages, yearly averages, and smoothed numbers. The standard smoothing is a 13-month running mean centered on the month in question and using half-weights for the months at the start and end. Solar cycle maxima and minima are usually given in terms of these smoothed numbers.

So this mean centred month was considered in the time series and if take this time series in days for 13 month, it comes roughly 400, hence the scaling for the last lambda was used but was didn’t have much affect on accuracy of the training model as both of the mentors have mentioned to you already.

Regards
DP

Thanks a lot for clarification Steffen Cologne. It solves my question :slight_smile: