How to calculate gradients for dense and lstm layers?

Hello everyone! I have a question regarding RNN (with lstm and dense layers) training. Forward prop is clear to me. Backprop through time for single lstm layer (many-to-many) and for lstm + dense with single output on the last time step (many-to-one) also clear. But I am not sure how to calculate the gradients for the model like lstm → dense. Per my understanding for the output dense layer I need to calculate gradients for every time step and then calculate the sum of them like:

do = np.zeros((lstm_unit_num, T_x)) # T_x is a number of time steps, here we will store the error to propagate back to the lstm model
dWx = np.zeros((dense_unit_num, lstm_unit_num # gradient w.r.t. weights to the lstm layer outputs
db = np.zeros((dense_unit_num,)) # gradient w.r.t. dense layer biases
for t in range(T_x):
    dY = 2 * (Y_pred[t] - Y[t]) # loss function derivative w.r.t. predicted value
    dAct = ((Y_pred[t] > 0) + 0) * dY # derivative of relu function
    db += dAct
    dWx += dAct @ x.T # x is a output vector of lsmt layer
    do[t] = Wx.T @ dAct # Wx is a matrix of weights of dense layer for lsmt outputs

###

# here I skipped the part where I calculate the gradients for lstm where we calculate the error for the lstm cell as do[t] + ds[t+1] (where ds is a gradient we pass form the t+1 time step

###

# next I update the weights of dense layer
Wx += alpha * dWx
b += alpha * db

# and here i skip the part to update the lstm cell weights

I skipped the calculation parts for lstm as it is clear to me. But I want to know if the dense layer gradients calculation is correct or not.

Thanks

Please give an example keras model of what you mean by this.

I mean structure like this.

Sequential([
    LSTM(...),
    Dense(...)
])

I want to implement it from scratch using plain numpy.

When doing the backward pass, the gradients flow to lstm layer via the dense layer. In the forward pass, lstm output of the last time step is passed to the dense layer. So, there’s no need to deal with time steps in the dense layer as long as return_sequences = False. Derivative of the loss with respect to the weights and biases in the dense layers are sufficient.

Frameworks track gradients based on the forward pass. To get a better insight into how this happens, see micrograd

Hi @balaji.ambresh, thank you for the reply. This case when return_sequences=False is clear now. What if the return_sequences=True? Are the calculations above correct for this case?

I don’t know on top of my head.

It’s possible that the dense layer will calculate the gradient based on the sum of losses from outputs of each time step. Please reply to this thread with what you find.

Well, in that case think about what the output of the LSTM layer looks like: it’s a 2D tensor not yet counting the “samples” dimension, right? So you’ll need a Flatten layer between LSTM and Dense, which takes care of how the gradients will propagate.

I admire your ambition to write all this directly in numpy, but I’m sure you realize that you don’t really need to do that. Just use TF and the lower level mechanics are all taken care of for you. :nerd_face: Doing this “by hand” will be an educational experience, I’m sure. :scream_cat:

Notice that if you had done the above in TF it would have thrown a dimension mismatch at runtime at the junction between LSTM and Dense without the Flatten layer.

Hi @paulinpaloalto, thanks for your reply. Yes I want to solve the problem manually because it is just interesting for me to do that. It is easy to implement the using TF, but I want to understand the math.

That’s great! Learning is the point here. You only responded to my general comments there. Did the actual point about the nature of the output tensor make sense? If you have an LSTM and you output the full time sequence, then for each input sample the output will have two dimensions: the label dimension (which is typically a one hot vector) and then the timestep dimension. Of course if you are doing things in a vectorized way, you also have the “samples” dimension, but that’s easy to deal with from the point of view of back propagation.

Yes, output tensor has 3 dimensions: output, samples, time steps. Output is a regression values. But i am confused how to calculate the gradients for dense layer and how to propagate it to LSTM correctly.

The point is that if you understood it in the return_sequences = False case, then it’s the same in the return_sequences = True case, right? The only difference is that there is a “Flatten” layer between the LSTM and the Dense layer in the “True” case. A “Flatten” layer doesn’t change any values (it just rearranges them), so the derivatives are just 1, right?

yes, it’s clear. So the calculation I put in the description is correct, right? The question is do i propagate the error from dense layer to LSTM?

Yes, as always, backward propagation is the mirror image of forward propagation. If you go from LSTM to Dense (or LSTM → Flatten → Dense) in the forward direction, then the gradients will propagate backward from Dense to Flatten (if present) to LSTM.

1 Like

Thanks! It is clear now.

When return_sequences=True, it’s not as simple as flattening all outputs and feeding them to the dense layer that follows it.

Please look at the examples below keeping the output shape and shape of dense layer parameters in mind. Dense layer with same parameters is used across all timesteps:

Considering a sequence prediction task, the model below is used for predicting 3 timesteps into the future, where each timestep has 1 feature.

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(8, input_shape=[3, 2], return_sequences=True),
    tf.keras.layers.Dense(1, activation='relu')
])
w, b = model.layers[-1].get_weights()
model.summary()
print(w.shape, b.shape)

Output:

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm_1 (LSTM)               (None, 3, 8)              352       
                                                                 
 dense_1 (Dense)             (None, 3, 1)              9         
                                                                 
=================================================================
Total params: 361
Trainable params: 361
Non-trainable params: 0
_________________________________________________________________
(8, 1) (1,)

Here’s another model that predicts 1 timestep into the future (return_sequences=False), where each timestep has 1 feature:

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(8, input_shape=[3, 2]),
    tf.keras.layers.Dense(1, activation='relu')
])
w, b = model.layers[-1].get_weights()
model.summary()
print(w.shape, b.shape)

Output:

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm_3 (LSTM)               (None, 8)                 352       
                                                                 
 dense_3 (Dense)             (None, 1)                 9         
                                                                 
=================================================================
Total params: 361
Trainable params: 361
Non-trainable params: 0
_________________________________________________________________
(8, 1) (1,)