C5 W4 A1 E3 help me I don't understand the dimensions of scaled_dot_product_attention

I’m assuming that all those puzzles are made because the code I made here was not an expected one,

{moderator edit - solution code removed}

Anyway, I kept getting an error message like this:
ValueError: non-broadcastable output operand with shape (3,4) doesn’t match the broadcast shape (1,3,4)

Probably scaled_attention_logits doesn’t broadcast here below.

{moderator edit - solution code removed}

Then well, the test was passed but it seems like no one in the forum used this.
So I wonder why, at first it didn’t broadcast without [0,:,:]
and I do really hope any generous and wise person to make me understand what’s going on on the dimensions of k,q,v, mask,d and scaled_attention_logits

For starters, we should be trying to use TF here. Notice in your first line that your mixing TF functions (tf.transpose) with numpy functions (np.matmul). I think that is not a good idea. Why not use tf.matmul there?

Then we see np.sqrt a couple of lines later. This spells trouble, I fear …

If you consistently use TF everywhere, then I think the type coercion and broadcasting issues work more smoothly.

In addition to that, the dimension you select of k for dk is different than the one I used.

I added a lot of print statements to my code and then ran the test cell for scaled_dot_product_attention. Here’s what I see (note that there are several separate test cases):

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

Seems like I should improve my skills on tensorflow functions, thanks it helped