Specialization: Natural Language Processing Specialization
Course: Natural Language Processing with Attention models
Week: 2
Assignment: C4W2
Function: DecoderLayer.call / Block2
I am not able to code the “GRADED FUNCTION: DecoderLayer” - I understand the Encoder Layer from previous steps.
I reached so far as the first multiheaded attention, followed by the normalization. I am not able to do after this for second multiheaded attention.
The instruction is "calculate self-attention using the Q from the first block and K and V from the encoder output. ".
Parameter: enc_output (tf.Tensor): Tensor of shape(batch_size, input_seq_len, fully_connected_dim)
In that understand what the Q is, I understand there is “encoder output” passed as parameter to the Decoder Layer. How do I get K and V from the encoder output? When “Encoder(…)” is called it returns Tensor (batch_size, input_seq_len, embedding_dim) - how to derive K and V from this?
I also doubt I have coded correctly for first MHA. What values and how to pass those for query, key and value for first MHA? There is parameter “x”, I don’t understand how to derive query/key/value from this.
x (tf.Tensor): Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
From arXiv:1706.03762,
that’s the original transformer paper architecture image, which is actually the same with the assignment, but is easier to see here how encoder and decoder are connected, and where K and V come from.
Getting K and V from the “encoder output” is as straightforward as depicted in the architecture. Not much is need
“I also doubt I have coded correctly for first MHA. What values and how to pass those for query, key and value for first MHA? There is parameter “x”, I don’t understand how to derive query/key/value from this.”
For this you can check out the implementation in the EncoderLayer class, in the DecoderLayer class it should be done in a similar manner.