# I have stuck with Course 5 Week 4 Assignments1 Ex3

{moderator edit - solution code removed}

I made the code above,and get the following results.

It seems to me that something is wrong with my code,and as a result,the attention weights is different from the right result,but I can’t tell where is the wrong point in my code. Please help.

I recommend you read the instructions for Exercise 3 carefully for how to apply the mask.

For the attention_weights, I recommend you use the tf.keras.activations.softmax(…) function.

For output, you should compute the matrix product of attention_weights and v.

Hello,

I apologize for reviving a solved topic, but I feel I need clarification on something and in my opinion the best way would be to ask right here so that the context is clear.

My doubt is about the `matmul_qk` and `dk` code. I had absolutely no clue how to do it until by God’s grace I stumbled upon this topic, where in the screenshot the code for those variables is given.

What I want to know is, given the instructions in the notebook, how on earth would someone figure out how to code for those two variables, i.e., `matmul_qk` and `dk`? Sure, the `tf.matmul` hint might help with the first variable, but I feel it would be next to impossible to figure out the second variable. So please explain how one is supposed to be able to code the second variable (`dk`) using only the instructions provided. Thank you and sorry again for reviving an old topic.

On thinking more closely, it is more clear about what `matmul_qk` and `dk` is. Yet, it is not clear how `dk` would be computed. Please shed light on that.

For `matmul_qk`, they literally write out the formula for you: QK^T. You just have to remember Prof Ng’s notational conventions which have been absolutely consistent from the very beginning of DLS Course 1: when he writes two array, vector or tensor arguments adjacent with no explicit operator, that means matrix multiplication (real “dot product” style). If he meant elementwise multiply, he consistently uses the operator *.

For d_k, the instructions say this:

``````𝑑𝑘
is the dimension of the keys, which is used to scale everything down so the softmax doesn't explode
``````

So maybe there is a bit of ambiguity there about which dimension they mean, but it is the sequence length. I used index -2 for that, just to be careful whether the “samples” dimension is present or not.

Hi, I have some implementation doubts and have sent an invite to you for a DM. Please help me by looking at my code.

Sure, I will respond in about an hour. Am away from my computer now. But if you copied all the code from that other thread, there are a number of problems with it. E.g. there should be no calls to the normalize function.

1 Like