As we can see below, to implement s_corrected and v_corrected, we need to use t (function argument). But i believe we need to use for loop variable l. Can someone help me differentiate l and t in this context and also in exponentially weighted average context? Any help appreciated.
for l in range(1, L + 1):
t is the iteration number on the overall loop, whereas l is (as you say) the layer number of the network. It requires that you understand the meaning of the math formula there. Notice that t is used as the exponent for \beta_1 and \beta_2. The point is that the EWA is happening individually at each layer w.r.t. the “timestep” or iteration count t.
The ‘t’ value is provided by the function that calls the Adam optimizer.
You can see it being incremented in the “model()” function in the assignment.
So it is because we use EWA for mini-batches and we need to count mini-batches to calculate v_corrected and s_corrected. If we use batch gradient descent then it is ok to use just l. Right?
Thank you for the previous answer.
No, the l and the t are completely different quantities. The t value changes every "step"or “iteration”, right? This function is being called multiple times for multiple steps one for each iteration. For each step, you process all the layers.