Unexplained, confusing and missing detail in Assignment C3W1

While doing assignment C3W1 and specifically while workingExercise 3 - GRULM, I found the explaination and implementation very poorly explained. For instance, while defining the class GRULM, there are these 2 lines of code:

  x, states = self.gru(x, initial_state=states, training=training)
  # Predict the next tokens and apply log-softmax activation
  x = self.dense(x, training=training) 

While working, it is totally non-obvious what is going on here. For instance, what is being stored in x and how is it interacting with next dense layer. So far, we were taught that Dense layer is fully connected, so it’s input should be a 2d thing (batch_size, n) meaning, some n some n scalers (activations) fully interacting with every neuron of Dense layer. However, the shape (output) of GRU unit is 3d (batch_size, sequence_length, rnn_units). How is this compatible with Dense layer is completely unexplained.

Moreover, It’s also nowhere clarified why this is the right thing to do. For instance, the output of GRU, captured in variable x is hidden state value at each time step (hence 3d shape). But in the lectures we were shown that hidden state is different from output (y). Why isn’t output captured here is also not explained. It’s assumed (as per my understanding) that hidden states are the predictions of each GRU unit and that’s what is interacting with Dense layer.

Again, very poorly written and explained assignment in my opinion. It. looks authors just hurried to complete this, which defeats the whole purpose of engaging with a course like this.

Dense layer computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel. For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input. More precisely, it implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer. If you don’t specify activation, no activation is applied (ie. “linear” activation: a(x) = x). Let’s see the details:

>>> import tensorflow as tf
>>> dense = tf.keras.layers.Dense(units=3)

Here we created a Dense layer with 3 units, but at this point, its kernel and bias are not initialized:

>>> dense.get_weights()
[]

Now we create an input and pass it through this layer:

>>> x = tf.constant([[[1., 2., 3., 4.], [5., 6., 7., 8.]], [[9., 10., 11., 12.], [13., 14., 15., 16.]]])
>>> x.shape
TensorShape([2, 2, 4])
>>> dense(x)
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[-0.60095024,  2.3085134 , -0.91536295],
        [-1.5299821 ,  2.8175893 , -2.2901776 ]],

       [[-2.459015  ,  3.326665  , -3.6649923 ],
        [-3.3880467 ,  3.835741  , -5.0398064 ]]], dtype=float32)>
>>> dense.get_weights()
[array([[-0.8197609 , -0.68349624, -0.56818855],
       [ 0.9160501 ,  0.14068007,  0.49996817],
       [ 0.29910052, -0.03030902,  0.24517775],
       [-0.62764776,  0.70039415, -0.520661  ]], dtype=float32), array([0., 0., 0.], dtype=float32)]

As you can see, the output has a shape of (2, 2, 3), and the layer’s kernel and bias were created and have shapes of (4, 3) and (3, ) respectively. We get the same result if we apply the transformation I described earlier:

>>> weight = dense.get_weights()[0]
>>> bias = dense.get_weights()[1]
>>> tf.tensordot(x, weight, axes=[[2], [0]]) + bias
<tf.Tensor: shape=(2, 2, 3), dtype=float32, numpy=
array([[[-0.60095024,  2.3085134 , -0.91536295],
        [-1.5299821 ,  2.8175893 , -2.2901776 ]],

       [[-2.459015  ,  3.326665  , -3.6649923 ],
        [-3.3880467 ,  3.835741  , -5.0398064 ]]], dtype=float32)>

Note, there is an argument axes. It specifies the indices over which tensordot sums the product of elements from two tensors. You can think of it as multiplying each (2 x 4) matrix from the batch and then stacking them:

>>> x[0]
<tf.Tensor: shape=(2, 4), dtype=float32, numpy=
array([[1., 2., 3., 4.],
       [5., 6., 7., 8.]], dtype=float32)>
>>> tf.matmul(x[0], weight)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-0.60095024,  2.3085134 , -0.91536295],
       [-1.5299821 ,  2.8175893 , -2.2901776 ]], dtype=float32)>
>>> tf.matmul(x[1], weight)
<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-2.459015 ,  3.326665 , -3.6649923],
       [-3.3880467,  3.835741 , -5.0398064]], dtype=float32)>

Right now I don’t have an access to the course materials, so me or other mentors will address other questions later.

hi Abhishek

the gru output is passed to the dense layer as the return sequence and return state are to be set true (These instructions are given before the grade cell). then the dense layer uses the self.dense recall from the def init where it uses the assigned vocab size and softmax activation to give the required output.

Although I do agree some of statement instructions are not given directly in the grade cell, most of the solution you will find in the assignment only probably in instructions before grade cell or nongrade cell.

Just giving you heads-up for other assignments from this course where to look upon.

There are some similar threads in discourse community which might help you.

Regards
DP

This
most of the solution you will find in the assignment only probably in instructions before grade cell or nongrade cell.

I think is a problem.

I’m not here to simply solve assignments based on instrucitons given. My intent is to truly understand what’s happening to every bit of depth. This is what other courses from DL.ai are doing well. They explain exactly what is happening with clarity. I expect the same with this course, especially for a sufficiently complex concept like GRUs and LSTMs.

@vaidabhishek Abhishek if you had stated I want more better explanation that would have also conveyed your discomfort rather than the statement you used.

I only conveyed you the solution as part of the query was addressed by the other mentor (but as I see he explained for the trax version of the course). I will try my best to explain the Gru and dense layer model of sentiment analysis. Please feel free to ask if you need more explanation or you don’t understand.

So the def init intialize the required dataset for the model based on embedding layer(256) which maps token indices using the embedding vector, basically assigning each a sequence of texts (token) into a vector i.r.t. vocab size.

The vocabulary used here is tokenized one where text data is converted into sequence of integer values and each integer value represents a specific sequence in the vocabulary.

GRU layer rnn units tries to remember each embedding vector value with its own position as well as its position in the given vocabulary allowing each text to be detected when an input text is provided to predict the next word.

Now in the def call function, first the initial state is determined using the gru layer from init function to make sure initial state always x0 for the given sequence where training is set true.

Then in the last dense layer where it has to predict the next token or word, it uses self.dense layer from init to determine the next word based on the activation function used.

for each token or input given, return state make sure the next token is predicted from the initial state and not from last word predicted.

Please feel free to ask if you didn’t understand.

Regards
DP