Exercise 1 angle calculation

I am lost here. If the angle calculation is ๐œƒ(๐‘๐‘œ๐‘ ,๐‘–,๐‘‘)=๐‘๐‘œ๐‘ /(10000*2๐‘–/๐‘‘) and the first couple of iโ€™s are zero, then you have division by zero. If i is a vector, then division of pos/i is not really defined, right?

Youโ€™re missing part of the equation, โ€˜iโ€™ is used as an exponent, not multiplication.

1 Like

Yup, thereโ€™s a big difference between a * b and a ** b. :nerd_face:

You can do โ€œelementwiseโ€ division with a vector in the denominator, the same way there is elementwise multiplication.

:joy: That is a big revelation! You have to admit, it kind of looks like multiiplication rather than an exponent. Now it works. Thank you!

Hereโ€™s a โ€œplayโ€ cell I added to my version of that notebook to show how elementwise divide works and โ€œplays wellโ€ with broadcasting:

A = np.arange(4)[:, np.newaxis]
B = 2 * np.ones((1,6))
print(f"A.shape {A.shape}")
print(f"A {A}")
print(f"B.shape {B.shape}")
print(f"B {B}")
quotient = A / B
print(f"quotient.shape {quotient.shape}")
print(f"quotient {quotient}")
A.shape (4, 1)
A [[0]
 [1]
 [2]
 [3]]
B.shape (1, 6)
B [[2. 2. 2. 2. 2. 2.]]
quotient.shape (4, 6)
quotient [[0.  0.  0.  0.  0.  0. ]
 [0.5 0.5 0.5 0.5 0.5 0.5]
 [1.  1.  1.  1.  1.  1. ]
 [1.5 1.5 1.5 1.5 1.5 1.5]]

Cool! Here is another question on the next exercise. I donโ€™t get how to do the sin/cos on both sides of the equation for angle_rads[:, 0::2] = ??? I tried to use the same [:, 0::2] but apply sin and I get the following error: could not broadcast input array from shape (8,16) into shape (8,8)

I think my angles are ok:

pos =  [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]]
k =  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
i =  [0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7]
pos =  [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]]
angle =  [[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.00000000e+00 1.00000000e+00 3.16227766e-01 3.16227766e-01
  1.00000000e-01 1.00000000e-01 3.16227766e-02 3.16227766e-02
  1.00000000e-02 1.00000000e-02 3.16227766e-03 3.16227766e-03
  1.00000000e-03 1.00000000e-03 3.16227766e-04 3.16227766e-04]
 [2.00000000e+00 2.00000000e+00 6.32455532e-01 6.32455532e-01
  2.00000000e-01 2.00000000e-01 6.32455532e-02 6.32455532e-02
  2.00000000e-02 2.00000000e-02 6.32455532e-03 6.32455532e-03
  2.00000000e-03 2.00000000e-03 6.32455532e-04 6.32455532e-04]
 [3.00000000e+00 3.00000000e+00 9.48683298e-01 9.48683298e-01
  3.00000000e-01 3.00000000e-01 9.48683298e-02 9.48683298e-02
  3.00000000e-02 3.00000000e-02 9.48683298e-03 9.48683298e-03
  3.00000000e-03 3.00000000e-03 9.48683298e-04 9.48683298e-04]
 [4.00000000e+00 4.00000000e+00 1.26491106e+00 1.26491106e+00
  4.00000000e-01 4.00000000e-01 1.26491106e-01 1.26491106e-01
  4.00000000e-02 4.00000000e-02 1.26491106e-02 1.26491106e-02
  4.00000000e-03 4.00000000e-03 1.26491106e-03 1.26491106e-03]
 [5.00000000e+00 5.00000000e+00 1.58113883e+00 1.58113883e+00
  5.00000000e-01 5.00000000e-01 1.58113883e-01 1.58113883e-01
  5.00000000e-02 5.00000000e-02 1.58113883e-02 1.58113883e-02
  5.00000000e-03 5.00000000e-03 1.58113883e-03 1.58113883e-03]
 [6.00000000e+00 6.00000000e+00 1.89736660e+00 1.89736660e+00
  6.00000000e-01 6.00000000e-01 1.89736660e-01 1.89736660e-01
  6.00000000e-02 6.00000000e-02 1.89736660e-02 1.89736660e-02
  6.00000000e-03 6.00000000e-03 1.89736660e-03 1.89736660e-03]
 [7.00000000e+00 7.00000000e+00 2.21359436e+00 2.21359436e+00
  7.00000000e-01 7.00000000e-01 2.21359436e-01 2.21359436e-01
  7.00000000e-02 7.00000000e-02 2.21359436e-02 2.21359436e-02
  7.00000000e-03 7.00000000e-03 2.21359436e-03 2.21359436e-03]]

Yes, you are taking a tensor and applying sin to it โ€œelementwiseโ€ to get another tensor of the same shape. You must not have actually used the same tensor indices on both sides.

Notice that your angles_rad tensor is 8 x 16, so somehow you must have used the whole thing on the RHS instead of slicing it.

Maybe this is a case in which you typed new code in the cell and then just called it again without actually clicking โ€œShift-Enterโ€ on the modified cell. So WYRINWYS (What You Ran Is Not What You See). Try:

Kernel -> Restart and Clear Output
Cell -> Run All Above

and see if that changes the behavior.

Hereโ€™s a post that talks in a bit more detail about this phenomenon.

Yes, that was it. I think I am going to just run all above every so often just to keep things current.

BTW creating the pos column vector and k row vector was a new challenge for me. Not much guidance on that but I guess they are trying to wean us from the hand holding. I did some research and ended up using np.arange and .reshape. Was that the reasonable approach?

An alternative method is using np.arange() and np.newaxis().

This is shown in the test cell just prior to Section 1.2.

1 Like

Yes, this notebook is the culmination of 5 courses and itโ€™s the one that does the least hand-holding of any so far. Quite a radical step from anything weโ€™ve seen up to this point, so our expectations need to be recalibrated a bit.

It sounds like you came up with a perfectly reasonable solution, but as Tom says, they actually gave you a โ€œworked exampleโ€ in the previous test cell:

from public_tests import *

get_angles_test(get_angles)

# Example
position = 4
d_model = 8
pos_m = np.arange(position)[:, np.newaxis]
dims = np.arange(d_model)[np.newaxis, :]
get_angles(pos_m, dims, d_model)

Itโ€™s not necessary on general principles to understand the test cells, as long as your code passes, but weโ€™re here to learn and you never know what wisdom may lurk even in the parts you donโ€™t technically need to read. :nerd_face:

I stand corrected!

1 Like

I havenโ€™t been looking that closely at the test cells, but I will be do so in the future.

1 Like

In exercise 3 I am getting the wrong weights. I donโ€™t understand the instruction: Multiply (1. - mask) by -1e9 before applying the softmax. Here are my results:

q,k,v dims =  (3, 4) (4, 4) (4, 2)
matmul_qk dim =  (3, 4)
dk =  4
scaled_att_logits =  tf.Tensor(
[[1.  1.5 0.5 0.5]
 [1.  1.  1.  0.5]
 [1.  1.  0.  0.5]], shape=(3, 4), dtype=float32)
att weights =  tf.Tensor(
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]], shape=(3, 4), dtype=float32)
output =  tf.Tensor(
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]], shape=(3, 2), dtype=float32)
q,k,v dims =  (3, 4) (4, 4) (4, 2)
matmul_qk dim =  (3, 4)
dk =  4
scaled_att_logits =  tf.Tensor(
[[[2.  2.5 0.5 1.5]
  [2.  2.  1.  1.5]
  [2.  2.  0.  1.5]]], shape=(1, 3, 4), dtype=float32)
att weights =  tf.Tensor(
[[[0.33333334 0.45186275 0.3071959  0.33333334]
  [0.33333334 0.27406862 0.5064804  0.33333334]
  [0.33333334 0.27406862 0.18632373 0.33333334]]], shape=(1, 3, 4), dtype=float32)
output =  tf.Tensor(
[[[1.092392   0.33333334]
  [1.1138824  0.33333334]
  [0.7937257  0.33333334]]], shape=(1, 3, 2), dtype=float32)

(1 - mask) โ€œreversesโ€ the mask to select the entries that are 0 in the mask and convert them to 1. If you then multiply by -10^9, it gives you a number which when fed to softmax will give an output that is very very close to 0. Then you add those values to the scaled attention logit values if the mask is present. In other words, that will have the effect of eliminating all entries not selected by the mask.

Here are my equivalent print outputs from that test cell:

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

Notice that your weights and outputs agree with mine for the first test case, but thatโ€™s the one with no mask. But then the second test with the mask gives different results, so itโ€™s clear where to look :nerd_face: โ€ฆ