Increasing loss with decreasing learning rate

Harshit1097 · April 13, 2024, 8:21am

It feels strange when the training loss starts to increase with decreasing learning rate. I am training a CNN model on medical images. I have applied exponential decay learning rate scheduler that decreases the learning rate by 20% after every half an epoch. The training set consists of 20 thousand images and I am using EfficientNet-B4 architecture for my purpose. After roughly 1.5 epochs, the training loss starts to increase even though the learning rate is decreasing. Why is this so? Increasing loss means the model is moving away from local/global optima but how can it move away if learning rate is decreasing? Is this because of some gradient related problems that they may be exploding or vanishing?

Nevermnd · April 13, 2024, 10:05am

Hmm,

Just a few questions: Why are you decreasing the learning rate mid-epoch (or then you are only half-way through a set training ?)

Also a decay of 20% each time sounds extremely high. Perhaps you are defining this figure in another way.

I mean if you loss then still starts increasing greatly, it could be that you’ve overshot the target.

Yet, also, more interesting to me which I, at least, haven’t heard or discussed here formally yet:

What if your network has ‘nothing to learn’ ? I am not saying that is what is happening in your particular case, but in the end these tools are not magic wands to produce signal when ‘there is no signal’.

Honestly I am not sure yet what the indication would be as to how you can tell (?)

TMosh · April 13, 2024, 5:06pm

Why?

hackyon · April 13, 2024, 5:56pm

Aside from the weird learning rate decay (as TMosh mentioned), there could be some other reasons why your training loss is increasing.

My first guess would be regularization (in this case, it could be a dropout hyperparameter you’re including). Depending on how your mini-batches are configured, you might also see an increasing training loss, but the overall trend should still be downwards (or staying roughly the same).

I imagine exploding and/or vanishing gradients could contribute, but it wouldn’t be my first guess.

ai_curious · April 13, 2024, 11:30pm

I don’t follow the logic here. Learning rate is a non-negative multiplicative constant. A change (eg decrease) in its value doesn’t determine the direction of the step, only influences its magnitude. Also, there is no learning rate memory in the Adam computation. By that I mean, it doesn’t look at the current learning rate, compare it to the previous learning rate, and then behave differently depending on whether the rate is increasing or decreasing. It just uses the current value in the multiplication. The direction of the step is determined solely by the gradient/ partial derivative.

Maybe one of the more math-literate community members can correct this if it got it wrong.

TMosh · April 13, 2024, 11:40pm

Not that I claim mathematical literacy, I just have a lot of lumps from accumulated mistakes.

The downfall of a too-large learning rate is that the magnitude of the updates could cause the solution to oscillate around the minimum, or even diverge and increase to infinite cost.

Increasing loss in batch training is a sign of trouble, in either your method, your model, or the hyperparameters you’ve selected.

There can be small variations in cost around the minimum for some training methods, such as stochastic gradient descent.

TMosh · April 13, 2024, 11:47pm

That’s a sign of trouble. Not just strangeness.

Harshit1097 · April 14, 2024, 5:59am

Thanks for the questions you’ve put up. Let me briefly explain the scenario. Actual training set consists of around 500 thousand samples. I trained the model using Adam optimizer and an initial learning rate of 5e-4. I also used ReduceLROnPlateau callback to reduce learning rate if the validation loss doesn’t decrease for 3 epochs. The result was that the validation loss did not decrease after the 2nd epoch. Similar trend was observed in 2 other models trained for different tasks. I figured that this might be due to too high a learning rate in the initial epochs or it might also be due to lack of regularization. I took out a sample of 20 thousand images for iteration purpose. Tried using exponential decay which decays the learning rate in a staircase fashion after every epoch. The result was almost the same. Analysing the progression of loss within the epoch signified that there might be a need to alter the learning rate midway through an epoch. Thus I tried doing that.

Nevermnd · April 15, 2024, 5:31am

@Harshit1097 I am still in the midst of learning this myself (around CNNs) so probably not the best one to ask, but have you looked into DLS C4W2’s discussion on ‘ResNets’ ?

At least superficially it sounds as if using a ResNet instead might apply to your problem:

ai_curious · April 15, 2024, 12:43pm

For those interested in getting into the weeds, you can read the code of the Adam update_step method on github. This is the line I was thinking about when I wrote above that learning rate is a non-negative multiplier.


alpha = lr * ops.sqrt(1 - beta_2_power) / (1 - beta_1_power)

There is no sign for learning rate- to the best of my knowledge there is no ‘unlearn’ concept in ML. And there is no retained state or memory, so the computation doesn’t depend on any change, rate of change, or direction of change: just the current value. Which is why my intuition is that from knowledge that the learning rate has decreased one cannot assume that gradient step or loss will also decrease. Too lazy to follow it all the way through the code to prove or disprove, sorry.

github.com

keras-team/keras/blob/v3.2.1/keras/optimizers/adam.py#L111-L146


      
          def update_step(self, gradient, variable, learning_rate):
              """Update step given gradient and the associated model variable."""
              lr = ops.cast(learning_rate, variable.dtype)
              gradient = ops.cast(gradient, variable.dtype)
              local_step = ops.cast(self.iterations + 1, variable.dtype)
              beta_1_power = ops.power(
                  ops.cast(self.beta_1, variable.dtype), local_step
              )
              beta_2_power = ops.power(
                  ops.cast(self.beta_2, variable.dtype), local_step
              )
          
              m = self._momentums[self._get_variable_index(variable)]
              v = self._velocities[self._get_variable_index(variable)]
          
              alpha = lr * ops.sqrt(1 - beta_2_power) / (1 - beta_1_power)
          
              self.assign_add(
                  m, ops.multiply(ops.subtract(gradient, m), 1 - self.beta_1)
              )

This file has been truncated. show original

hackyon · April 15, 2024, 10:09pm

Harshit1097:

Thanks for the questions you’ve put up. Let me briefly explain the scenario. Actual training set consists of around 500 thousand samples. I trained the model using Adam optimizer and an initial learning rate of 5e-4. I also used ReduceLROnPlateau callback to reduce learning rate if the validation loss doesn’t decrease for 3 epochs. The result was that the validation loss did not decrease after the 2nd epoch. Similar trend was observed in 2 other models trained for different tasks. I figured that this might be due to too high a learning rate in the initial epochs or it might also be due to lack of regularization. I took out a sample of 20 thousand images for iteration purpose. Tried using exponential decay which decays the learning rate in a staircase fashion after every epoch. The result was almost the same. Analysing the progression of loss within the epoch signified that there might be a need to alter the learning rate midway through an epoch. Thus I tried doing that.

Um… sounds like there’s a couple of things that might be wrong:

Are you talking about training loss or validation loss? In your first post, you say that your training loss was increasing, and in your second post, you say that you’re using the validation loss?
Is the training loss/validation loss not decreasing, or is it increasing? There’s a big difference here. If it’s not decreasing (stays the same), you might have hit a minimum. If it’s increasing, then there’s an issue with the model

Perhaps you can post a table or graph of the training loss and the validation loss you are seeing? That would help with the investigation.

Either way, i don’t think changing the decay rate midway through epoch will help the problem. I don’t see any major problem applying any decay at the end of epochs instead of midway through.

Harshit1097 · April 29, 2024, 4:32am

Train dataset consists of around 100 thousand data points. Batch size is 32. I’ve saved the train loss, valid loss after every 50th batch is passed. The resultant graph is as follows:

loss

Training happened for 5 epochs. So the graph that you see above shows the variation of loss for 5 epochs. Blue colored line is the training loss saved for every 50th batch, and the orange one is valid loss.
I’ve used a learning rate scheduler that decreases the learning rate by 50% after every 2 epochs. So the variation of learning rate for these 5 epochs is as follows:

What I said earlier was that the train loss increases after a certain point of time which is what I’m confused about.

TMosh · April 29, 2024, 5:54am

Most likely, the statistics of your dataset is not uniform, and you’re getting some new examples starting around the 200th epoch that don’t fit the model very well that was learned in the first 200 epochs.

This sort of thing is very common when using batch processing.

Harshit1097 · April 29, 2024, 6:27am

Thanks for the insight. Just one more thing to add on, I did not see this kind of behaviour of the training loss when I did not use learning rate scheduling. In fact during the past 5-6 months, I have trained several models starting with an initial learning rate of 5e-4, and using ReduceLROnPlateau to decrease learning rate if the validation loss isn’t improving for 3 epochs. The dataset used for training all these models is nearly the same, but I never witnessed an increasing training loss. I suspect that there is some issue with respect to using a learning rate schedule which I’m unable to figure out.

TMosh · April 29, 2024, 3:13pm

I’m not entirely sure why you would decrease the learning rate. Doesn’t that just slow down the convergence rate?

TMosh · April 29, 2024, 3:17pm

I’m inclined to think there may be an error in your implementation of this model and training method.