I got an AssertionError in Ex8 - Transformer Assignment

I had almost finished the Transformer Assignment, but I go an AssertionError in the last exercise, and I passed all of the previous exercises:

AssertionError Traceback (most recent call last)
----> 2 Transformer_test(Transformer, create_look_ahead_mask, create_padding_mask)

~/work/W4A1/public_tests.py in Transformer_test(target, create_look_ahead_mask, create_padding_mask)
286 assert np.allclose(translation[0, 0, 0:8],
287 [0.017416516, 0.030932948, 0.024302809, 0.01997807,
→ 288 0.014861834, 0.034384135, 0.054789476, 0.032087505]), “Wrong values in translation”
290 keys = list(weights.keys())

AssertionError: Wrong values in translation

Assuming that you implemented all previous functions, the most possible case is the selection of sentence. See my annotated version of Transformer overview.

Have you selected appropriate inputs for Encoder and Decoder ?

Solved, thank you. I called the decoder with the input_sentence instead of the output_sentence.

Just two more questions that got me puzzled:

  • Why don’t we use scaled_dot_product_attention anywhere?
  • I’m a little confused by how we used word embeddings in this assignment since in the video lecture we didn’t go into it. So basically we reduced all that we did relate to word embeddings in week2 assig2 into an Embedded layer from tensorflow? I’m sorry if this question sounds stupid but I’m trying to understand every little detail of this exercise

Here is an overview of “encoder” portion of Transformer.

‘Scaled dot product’ is one of key functions in Transformer, but is part of Keras MultiHeadAttention that we used for this assignment. So, implemented function is not used. But, due to its importance, I suppose it is included in this exercise.

Word embeddings is what we learned in the first exercise in the 2nd week, not the 2nd exercise. It is to represent a meaning of a “word” by a “multi-dimensional vector”. So, if meanings of two words are close, then, vectors should represent similar direction. This is useful to find a similarity of two words, and also represent characteristics of a word.

If you look at the figure above, you see another important data, a positional encoding, which is merged with a word embedding as an input to the Encoder. As a word embedding does not include a position related information, we need to add position information for creating “attentions”. That’s a position encoding. Then, merged data is fed to a multi-head-attention layer, which includes ‘scaled dot product’. (This figure also illustrates how input Q/K/V can be distributed to multiple “heads” in multi-head-attention layer.)

Hope this helps.

Sry I meant the second week. About word embeddings and position encoding, I know what they are and what they do.

What I’m trying to figure out is if the Embedding layer we imported from TensorFlow Keras does more or less what we accomplished in week 2 ass. 2

Edit; Btw would love to know how did you get those amazing images

I see your point. And, actually, a good point.

In our exercise for “Word Vectors”, we used the GloVe vector, which is one of commonly used word vectors. (Other famous one is ‘word2vec’, which is also commonly used.)

On the other hand, in this Transformer exercise, we used Keras Embedding. Remember that we imported and set.

from tensorflow.keras.layers import Embedding, MultiHeadAttention, Dense, Input, Dropout, LayerNormalization

self.embedding = Embedding(input_vocab_size, self.embedding_dim)

It is not GloVe vector that we used. This Keras Embedding is “trainable”. So, if we use this in a real project, we need to “train” this, or load weights (embedding matrix) from other technologies like GloVe or word2vec, and make it “not trainable. (trainable=False)”. As this is not a commercial use, but for learning, we just used Keras Embedding with default setting.

Hope this helps.

Thank you so much for the detailed explanations!