While doing assignment C3W1
and specifically while workingExercise 3 - GRULM
, I found the explaination and implementation very poorly explained. For instance, while defining the class GRULM, there are these 2 lines of code:
x, states = self.gru(x, initial_state=states, training=training)
# Predict the next tokens and apply log-softmax activation
x = self.dense(x, training=training)
While working, it is totally non-obvious what is going on here. For instance, what is being stored in x and how is it interacting with next dense layer. So far, we were taught that Dense layer is fully connected, so it’s input should be a 2d thing (batch_size, n)
meaning, some n
some n scalers (activations) fully interacting with every neuron of Dense layer. However, the shape (output) of GRU unit is 3d (batch_size, sequence_length, rnn_units)
. How is this compatible with Dense layer is completely unexplained.
Moreover, It’s also nowhere clarified why this is the right thing to do. For instance, the output of GRU, captured in variable x
is hidden state value at each time step (hence 3d shape). But in the lectures we were shown that hidden state is different from output (y
). Why isn’t output captured here is also not explained. It’s assumed (as per my understanding) that hidden states are the predictions of each GRU unit and that’s what is interacting with Dense layer.
Again, very poorly written and explained assignment in my opinion. It. looks authors just hurried to complete this, which defeats the whole purpose of engaging with a course like this.