Why learning rate is still considered as the hyper parameter?

A hyper parameter differ from normal parameter because it remains untouched during the training process, and can be changed when running different training process. So it should remain constant for the current process. Also from the lecture of Andrew sir I can see that choosing the value of LR influences the model convergence.

Then we use ReduceLROnPlateau callback to update the LR when model is not performing well.

To me it is same like we update weights when model is not performing well. Because in both cases not performing well means it is not fully converged or still has some margin left to be optimized.

The Google Cloud documentation on Hyper-parameter tuning makes more sense to me

Hyperparameters are tuned by running your whole training job, looking at the aggregate accuracy, and adjusting

Here the values are tuned after first training job is finished.

1 Like

You are correct. You would not change the learning rate during a training session.

1 Like

But this Callback then?

Sorry, I do not understand your question.

According to you and what I have understood so far is that the hyper parameters for the current training job remains constant, and it is changed for the next training job. This can be shown programmatically as bellow

W, B = # initialize the weights and bias here

def get_model(W, B, lr):
     # define layers with weights W and biase B
     model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr), ....)
     return model

Now for the training job #1

model = get_model(W, B, 0.1)

and for the Training job #2

model = get_model(W, B, 0.01)

As you can see that in two cases only learning rate is changed from 0.1 to 0.01. This is what defined the model hyperparameters for first training job the LR will always be 0.1 and for the second training job LR will always be 0.01. This things makes sense to me. But the following, where we define the ReduceLROnPlateau callback in the fit method and it changes the learning rate during training process.

lr = 0.1
lr_callback = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10)
model = get_model(W, B, lr)
model.fit(X_train, y_train, callbacks=[lr_callback])

In this case the initial value of lr will not be equal to the final value of the lr when execution of .fit method finishes.

Note: Here from training job I mean iterating though all the example in training set X_train for all the epochs i.

Where does this appear in the notebook? You have not said what part of the course you’re working on.

It is not from the course, I should have changed the category to General Discussion, my bad

It is now updated

1 Like

Note that I don’t personally consider the learning rate to be a hyperparameter of the model. To me it’s just part of the mechanism for finding the minimum cost using the gradient descent method. If you use some other optimizers, they don’t even have a learning rate that you can access.

The hyperparameters of interest would be more closely related to the implementation of the model, including any additional features that are added for more complexity, or to avoid overfitting.

2 Likes

Well I found another person who thinks like I do, nice to meet you @TMosh. But instead mechanism for finding the minimum, I consider it as a scaling factor for the gradient descent. It also sometimes helps in escaping the local minima.

For linear regression and logistic regression, the cost functions are convex, so there are no local minima.

Exactly, but i was talking generally. As in more complex problem the cost function is not always bowl shaped

Yes, I would think the layers, units, batch size to fit the definition of the hyper parameter.

Since this is an introductory course, it doesn’t very much get into the details of cost functions with local minima.

Hello,

Given the definition of hyperparameters is not super defined. Methods like the ReduceLROnPlateau callback that update the LR try to fix a well-known problem: vanilla gradient descent is not a very “optimal” algorithm. Where with “optimal”, I mean that it is inefficient in real-case scenarios and often very slow to reach the minimum. For example, as you showed in the pictures, the convergence depends critically on the value of alpha.
In particular, methods like the ReduceLROnPlateau callback wants to solve the issue of doing a complete, long training procedure and finding out at the end that our LR was not good, and that for half of the steps of gradient descent, the algorithm basically did nothing. Then we would have to restart the training with a new LR.
So here, instead, we try to find a “smart” way to understand when the LR should change and do it automatically during the training.

Sorry for the long introduction. Going back to your question.
Again the definition of hypers is not super defined, but in my opinion, there are two ways to look at a procedure like that. A training procedure with something like ReduceLROnPlateau is equivalent to have many training one after the other of length “patience” where the LR maintain the constant value, and it is a hyper in your definition. If you want to look at the whole training process, the hypers are the input of ReduceLROnPlateau, e.g. factor, patience, the starting and ending LR, which at the end of the day are also the values that you what to choose optimally to have a “fast” training.

I hope it was useful and not to pedantic.

Wow, thanks for providing more information.

So the callback is basically used to save the time, by efficiently decreasing the value of the LR based on the configuration provided in the constructor and training metadata from the fit method.

This is smart hack :smile: