Hey everyone,
In the Adam optimization we use only 1 value for the initial learning rate, whereas as discussed in the video we have alpha[1-11] values. Shouldn’t we use 11 values for the optimization?
Also understanding why a learning rate can’t be an array one needs to know learning rate is a float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use.
Here is how TensorFlow implements Adam optimization (see below). Notice that the learning rate is a scalar value. This is the only learning rate that you have access to.
The Adam optimization allows each feature weight to decay at individual rates, based on the 1st and 2nd order moments. These are based on the beta_1 and beta_2 factors (also scalars). But you (as the designer) cannot control the learning rate for each feature.
I may have some of the details wrong, the implementation of an optimizer is complex and not my area of expertise.