Hello everyone! I have a question regarding RNN (with lstm and dense layers) training. Forward prop is clear to me. Backprop through time for single lstm layer (many-to-many) and for lstm + dense with single output on the last time step (many-to-one) also clear. But I am not sure how to calculate the gradients for the model like lstm → dense. Per my understanding for the output dense layer I need to calculate gradients for every time step and then calculate the sum of them like:
do = np.zeros((lstm_unit_num, T_x)) # T_x is a number of time steps, here we will store the error to propagate back to the lstm model
dWx = np.zeros((dense_unit_num, lstm_unit_num # gradient w.r.t. weights to the lstm layer outputs
db = np.zeros((dense_unit_num,)) # gradient w.r.t. dense layer biases
for t in range(T_x):
dY = 2 * (Y_pred[t] - Y[t]) # loss function derivative w.r.t. predicted value
dAct = ((Y_pred[t] > 0) + 0) * dY # derivative of relu function
db += dAct
dWx += dAct @ x.T # x is a output vector of lsmt layer
do[t] = Wx.T @ dAct # Wx is a matrix of weights of dense layer for lsmt outputs
###
# here I skipped the part where I calculate the gradients for lstm where we calculate the error for the lstm cell as do[t] + ds[t+1] (where ds is a gradient we pass form the t+1 time step
###
# next I update the weights of dense layer
Wx += alpha * dWx
b += alpha * db
# and here i skip the part to update the lstm cell weights
I skipped the calculation parts for lstm as it is clear to me. But I want to know if the dense layer gradients calculation is correct or not.
Thanks