A hyper parameter differ from normal parameter because it remains untouched during the training process, and can be changed when running different training process. So it should remain constant for the current process. Also from the lecture of Andrew sir I can see that choosing the value of LR influences the model convergence.
Then we use ReduceLROnPlateau callback to update the LR when model is not performing well.
To me it is same like we update weights when model is not performing well. Because in both cases not performing well means it is not fully converged or still has some margin left to be optimized.
According to you and what I have understood so far is that the hyper parameters for the current training job remains constant, and it is changed for the next training job. This can be shown programmatically as bellow
W, B = # initialize the weights and bias here
def get_model(W, B, lr):
# define layers with weights W and biase B
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr), ....)
return model
Now for the training job #1
model = get_model(W, B, 0.1)
and for the Training job #2
model = get_model(W, B, 0.01)
As you can see that in two cases only learning rate is changed from 0.1 to 0.01. This is what defined the model hyperparameters for first training job the LR will always be 0.1 and for the second training job LR will always be 0.01. This things makes sense to me. But the following, where we define the ReduceLROnPlateau callback in the fit method and it changes the learning rate during training process.
lr = 0.1
lr_callback = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10)
model = get_model(W, B, lr)
model.fit(X_train, y_train, callbacks=[lr_callback])
In this case the initial value of lr will not be equal to the final value of the lr when execution of .fit method finishes.
Note: Here from training job I mean iterating though all the example in training set X_train for all the epochs i.
Note that I don’t personally consider the learning rate to be a hyperparameter of the model. To me it’s just part of the mechanism for finding the minimum cost using the gradient descent method. If you use some other optimizers, they don’t even have a learning rate that you can access.
The hyperparameters of interest would be more closely related to the implementation of the model, including any additional features that are added for more complexity, or to avoid overfitting.
Well I found another person who thinks like I do, nice to meet you @TMosh. But instead mechanism for finding the minimum, I consider it as a scaling factor for the gradient descent. It also sometimes helps in escaping the local minima.
Given the definition of hyperparameters is not super defined. Methods like the ReduceLROnPlateau callback that update the LR try to fix a well-known problem: vanilla gradient descent is not a very “optimal” algorithm. Where with “optimal”, I mean that it is inefficient in real-case scenarios and often very slow to reach the minimum. For example, as you showed in the pictures, the convergence depends critically on the value of alpha.
In particular, methods like the ReduceLROnPlateau callback wants to solve the issue of doing a complete, long training procedure and finding out at the end that our LR was not good, and that for half of the steps of gradient descent, the algorithm basically did nothing. Then we would have to restart the training with a new LR.
So here, instead, we try to find a “smart” way to understand when the LR should change and do it automatically during the training.
Sorry for the long introduction. Going back to your question.
Again the definition of hypers is not super defined, but in my opinion, there are two ways to look at a procedure like that. A training procedure with something like ReduceLROnPlateau is equivalent to have many training one after the other of length “patience” where the LR maintain the constant value, and it is a hyper in your definition. If you want to look at the whole training process, the hypers are the input of ReduceLROnPlateau, e.g. factor, patience, the starting and ending LR, which at the end of the day are also the values that you what to choose optimally to have a “fast” training.
So the callback is basically used to save the time, by efficiently decreasing the value of the LR based on the configuration provided in the constructor and training metadata from the fit method.