I’m assuming that all those puzzles are made because the code I made here was not an expected one,
{moderator edit - solution code removed}
Anyway, I kept getting an error message like this:
ValueError: non-broadcastable output operand with shape (3,4) doesn’t match the broadcast shape (1,3,4)
Probably scaled_attention_logits doesn’t broadcast here below.
{moderator edit - solution code removed}
Then well, the test was passed but it seems like no one in the forum used this.
So I wonder why, at first it didn’t broadcast without [0,:,:]
and I do really hope any generous and wise person to make me understand what’s going on on the dimensions of k,q,v, mask,d and scaled_attention_logits
For starters, we should be trying to use TF here. Notice in your first line that your mixing TF functions (tf.transpose
) with numpy functions (np.matmul
). I think that is not a good idea. Why not use tf.matmul
there?
Then we see np.sqrt
a couple of lines later. This spells trouble, I fear …
If you consistently use TF everywhere, then I think the type coercion and broadcasting issues work more smoothly.
In addition to that, the dimension you select of k for dk is different than the one I used.
I added a lot of print statements to my code and then ran the test cell for scaled_dot_product_attention
. Here’s what I see (note that there are several separate test cases):
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
[2. 2. 2. 1.]
[2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478 0.42693272 0.15705977 0.15705977]
[0.2772748 0.2772748 0.2772748 0.16817567]
[0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
[1. ]
[1. ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
[0.7227253 0.16817567]
[0.6637989 0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk =
[[2. 3. 1. 1.]
[2. 2. 2. 1.]
[2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
[1 1 0 1]
[1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959 0.5064804 0. 0.18632373]
[0.38365173 0.38365173 0. 0.23269653]
[0.38365173 0.38365173 0. 0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
[1.]
[1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041 0.18632373]
[0.61634827 0.23269653]
[0.61634827 0.23269653]]]
All tests passed
Seems like I should improve my skills on tensorflow functions, thanks it helped