Why don't we put a vector of initial learning rate in adam optimization instead of a single one

Class: Additional Neural network concepts > Advanced Optimization

Hey everyone,
In the Adam optimization we use only 1 value for the initial learning rate, whereas as discussed in the video we have alpha[1-11] values. Shouldn’t we use 11 values for the optimization?

There is no method for adjusting multiple independent learning rates at the same time.

I mean, why didn’t they implement it like that?? To take an array instead of a single value?

Because it would be very complicated, and is not necessary.

If you normalize the features, then one learning rate applies equally to everything, and normalization also makes the minimization more efficient.

Also understanding why a learning rate can’t be an array one needs to know learning rate is a float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use.

So using an array will cause an error.

But in the lecture, the teacher said that:

It uses a different learning rate for every single parameter of your model.

Please give the lecture title and time mark.

The lecture is slightly misleading.

Here is how TensorFlow implements Adam optimization (see below). Notice that the learning rate is a scalar value. This is the only learning rate that you have access to.

The Adam optimization allows each feature weight to decay at individual rates, based on the 1st and 2nd order moments. These are based on the beta_1 and beta_2 factors (also scalars). But you (as the designer) cannot control the learning rate for each feature.

I may have some of the details wrong, the implementation of an optimizer is complex and not my area of expertise.

Thanks, this explains a lot!