I am lost here. If the angle calculation is ๐(๐๐๐ ,๐,๐)=๐๐๐ /(10000*2๐/๐) and the first couple of iโs are zero, then you have division by zero. If i is a vector, then division of pos/i is not really defined, right?
Yup, thereโs a big difference between a * b and a ** b. ![]()
You can do โelementwiseโ division with a vector in the denominator, the same way there is elementwise multiplication.
That is a big revelation! You have to admit, it kind of looks like multiiplication rather than an exponent. Now it works. Thank you!
Hereโs a โplayโ cell I added to my version of that notebook to show how elementwise divide works and โplays wellโ with broadcasting:
A = np.arange(4)[:, np.newaxis]
B = 2 * np.ones((1,6))
print(f"A.shape {A.shape}")
print(f"A {A}")
print(f"B.shape {B.shape}")
print(f"B {B}")
quotient = A / B
print(f"quotient.shape {quotient.shape}")
print(f"quotient {quotient}")
A.shape (4, 1)
A [[0]
[1]
[2]
[3]]
B.shape (1, 6)
B [[2. 2. 2. 2. 2. 2.]]
quotient.shape (4, 6)
quotient [[0. 0. 0. 0. 0. 0. ]
[0.5 0.5 0.5 0.5 0.5 0.5]
[1. 1. 1. 1. 1. 1. ]
[1.5 1.5 1.5 1.5 1.5 1.5]]
Cool! Here is another question on the next exercise. I donโt get how to do the sin/cos on both sides of the equation for angle_rads[:, 0::2] = ??? I tried to use the same [:, 0::2] but apply sin and I get the following error: could not broadcast input array from shape (8,16) into shape (8,8)
I think my angles are ok:
pos = [[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]]
k = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
i = [0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7]
pos = [[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]]
angle = [[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.00000000e+00 1.00000000e+00 3.16227766e-01 3.16227766e-01
1.00000000e-01 1.00000000e-01 3.16227766e-02 3.16227766e-02
1.00000000e-02 1.00000000e-02 3.16227766e-03 3.16227766e-03
1.00000000e-03 1.00000000e-03 3.16227766e-04 3.16227766e-04]
[2.00000000e+00 2.00000000e+00 6.32455532e-01 6.32455532e-01
2.00000000e-01 2.00000000e-01 6.32455532e-02 6.32455532e-02
2.00000000e-02 2.00000000e-02 6.32455532e-03 6.32455532e-03
2.00000000e-03 2.00000000e-03 6.32455532e-04 6.32455532e-04]
[3.00000000e+00 3.00000000e+00 9.48683298e-01 9.48683298e-01
3.00000000e-01 3.00000000e-01 9.48683298e-02 9.48683298e-02
3.00000000e-02 3.00000000e-02 9.48683298e-03 9.48683298e-03
3.00000000e-03 3.00000000e-03 9.48683298e-04 9.48683298e-04]
[4.00000000e+00 4.00000000e+00 1.26491106e+00 1.26491106e+00
4.00000000e-01 4.00000000e-01 1.26491106e-01 1.26491106e-01
4.00000000e-02 4.00000000e-02 1.26491106e-02 1.26491106e-02
4.00000000e-03 4.00000000e-03 1.26491106e-03 1.26491106e-03]
[5.00000000e+00 5.00000000e+00 1.58113883e+00 1.58113883e+00
5.00000000e-01 5.00000000e-01 1.58113883e-01 1.58113883e-01
5.00000000e-02 5.00000000e-02 1.58113883e-02 1.58113883e-02
5.00000000e-03 5.00000000e-03 1.58113883e-03 1.58113883e-03]
[6.00000000e+00 6.00000000e+00 1.89736660e+00 1.89736660e+00
6.00000000e-01 6.00000000e-01 1.89736660e-01 1.89736660e-01
6.00000000e-02 6.00000000e-02 1.89736660e-02 1.89736660e-02
6.00000000e-03 6.00000000e-03 1.89736660e-03 1.89736660e-03]
[7.00000000e+00 7.00000000e+00 2.21359436e+00 2.21359436e+00
7.00000000e-01 7.00000000e-01 2.21359436e-01 2.21359436e-01
7.00000000e-02 7.00000000e-02 2.21359436e-02 2.21359436e-02
7.00000000e-03 7.00000000e-03 2.21359436e-03 2.21359436e-03]]
Yes, you are taking a tensor and applying sin to it โelementwiseโ to get another tensor of the same shape. You must not have actually used the same tensor indices on both sides.
Notice that your angles_rad tensor is 8 x 16, so somehow you must have used the whole thing on the RHS instead of slicing it.
Maybe this is a case in which you typed new code in the cell and then just called it again without actually clicking โShift-Enterโ on the modified cell. So WYRINWYS (What You Ran Is Not What You See). Try:
Kernel -> Restart and Clear Output
Cell -> Run All Above
and see if that changes the behavior.
Hereโs a post that talks in a bit more detail about this phenomenon.
Yes, that was it. I think I am going to just run all above every so often just to keep things current.
BTW creating the pos column vector and k row vector was a new challenge for me. Not much guidance on that but I guess they are trying to wean us from the hand holding. I did some research and ended up using np.arange and .reshape. Was that the reasonable approach?
An alternative method is using np.arange() and np.newaxis().
This is shown in the test cell just prior to Section 1.2.
Yes, this notebook is the culmination of 5 courses and itโs the one that does the least hand-holding of any so far. Quite a radical step from anything weโve seen up to this point, so our expectations need to be recalibrated a bit.
It sounds like you came up with a perfectly reasonable solution, but as Tom says, they actually gave you a โworked exampleโ in the previous test cell:
from public_tests import *
get_angles_test(get_angles)
# Example
position = 4
d_model = 8
pos_m = np.arange(position)[:, np.newaxis]
dims = np.arange(d_model)[np.newaxis, :]
get_angles(pos_m, dims, d_model)
Itโs not necessary on general principles to understand the test cells, as long as your code passes, but weโre here to learn and you never know what wisdom may lurk even in the parts you donโt technically need to read. ![]()
I stand corrected!
I havenโt been looking that closely at the test cells, but I will be do so in the future.
In exercise 3 I am getting the wrong weights. I donโt understand the instruction: Multiply (1. - mask) by -1e9 before applying the softmax. Here are my results:
q,k,v dims = (3, 4) (4, 4) (4, 2)
matmul_qk dim = (3, 4)
dk = 4
scaled_att_logits = tf.Tensor(
[[1. 1.5 0.5 0.5]
[1. 1. 1. 0.5]
[1. 1. 0. 0.5]], shape=(3, 4), dtype=float32)
att weights = tf.Tensor(
[[0.2589478 0.42693272 0.15705977 0.15705977]
[0.2772748 0.2772748 0.2772748 0.16817567]
[0.33620113 0.33620113 0.12368149 0.2039163 ]], shape=(3, 4), dtype=float32)
output = tf.Tensor(
[[0.74105227 0.15705977]
[0.7227253 0.16817567]
[0.6637989 0.2039163 ]], shape=(3, 2), dtype=float32)
q,k,v dims = (3, 4) (4, 4) (4, 2)
matmul_qk dim = (3, 4)
dk = 4
scaled_att_logits = tf.Tensor(
[[[2. 2.5 0.5 1.5]
[2. 2. 1. 1.5]
[2. 2. 0. 1.5]]], shape=(1, 3, 4), dtype=float32)
att weights = tf.Tensor(
[[[0.33333334 0.45186275 0.3071959 0.33333334]
[0.33333334 0.27406862 0.5064804 0.33333334]
[0.33333334 0.27406862 0.18632373 0.33333334]]], shape=(1, 3, 4), dtype=float32)
output = tf.Tensor(
[[[1.092392 0.33333334]
[1.1138824 0.33333334]
[0.7937257 0.33333334]]], shape=(1, 3, 2), dtype=float32)
(1 - mask) โreversesโ the mask to select the entries that are 0 in the mask and convert them to 1. If you then multiply by -10^9, it gives you a number which when fed to softmax will give an output that is very very close to 0. Then you add those values to the scaled attention logit values if the mask is present. In other words, that will have the effect of eliminating all entries not selected by the mask.
Here are my equivalent print outputs from that test cell:
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
[2. 2. 2. 1.]
[2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478 0.42693272 0.15705977 0.15705977]
[0.2772748 0.2772748 0.2772748 0.16817567]
[0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
[1. ]
[1. ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
[0.7227253 0.16817567]
[0.6637989 0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
[2. 2. 2. 1.]
[2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
[1 1 0 1]
[1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959 0.5064804 0. 0.18632373]
[0.38365173 0.38365173 0. 0.23269653]
[0.38365173 0.38365173 0. 0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
[1.]
[1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041 0.18632373]
[0.61634827 0.23269653]
[0.61634827 0.23269653]]]
All tests passed
Notice that your weights and outputs agree with mine for the first test case, but thatโs the one with no mask. But then the second test with the mask gives different results, so itโs clear where to look
โฆ
