Exercise 1 angle calculation

rvgeiger · January 4, 2026, 4:23pm

I am lost here. If the angle calculation is 𝜃(𝑝𝑜𝑠,𝑖,𝑑)=𝑝𝑜𝑠/(10000*2𝑖/𝑑) and the first couple of i’s are zero, then you have division by zero. If i is a vector, then division of pos/i is not really defined, right?

TMosh · January 4, 2026, 4:59pm

You’re missing part of the equation, ‘i’ is used as an exponent, not multiplication.

paulinpaloalto · January 4, 2026, 5:09pm

Yup, there’s a big difference between a * b and a ** b.

You can do “elementwise” division with a vector in the denominator, the same way there is elementwise multiplication.

rvgeiger · January 4, 2026, 5:14pm

That is a big revelation! You have to admit, it kind of looks like multiiplication rather than an exponent. Now it works. Thank you!

paulinpaloalto · January 4, 2026, 5:20pm

Here’s a “play” cell I added to my version of that notebook to show how elementwise divide works and “plays well” with broadcasting:

A = np.arange(4)[:, np.newaxis]
B = 2 * np.ones((1,6))
print(f"A.shape {A.shape}")
print(f"A {A}")
print(f"B.shape {B.shape}")
print(f"B {B}")
quotient = A / B
print(f"quotient.shape {quotient.shape}")
print(f"quotient {quotient}")
A.shape (4, 1)
A [[0]
 [1]
 [2]
 [3]]
B.shape (1, 6)
B [[2. 2. 2. 2. 2. 2.]]
quotient.shape (4, 6)
quotient [[0.  0.  0.  0.  0.  0. ]
 [0.5 0.5 0.5 0.5 0.5 0.5]
 [1.  1.  1.  1.  1.  1. ]
 [1.5 1.5 1.5 1.5 1.5 1.5]]

rvgeiger · January 4, 2026, 6:23pm

Cool! Here is another question on the next exercise. I don’t get how to do the sin/cos on both sides of the equation for angle_rads[:, 0::2] = ??? I tried to use the same [:, 0::2] but apply sin and I get the following error: could not broadcast input array from shape (8,16) into shape (8,8)

I think my angles are ok:

pos =  [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]]
k =  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
i =  [0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7]
pos =  [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]]
angle =  [[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.00000000e+00 1.00000000e+00 3.16227766e-01 3.16227766e-01
  1.00000000e-01 1.00000000e-01 3.16227766e-02 3.16227766e-02
  1.00000000e-02 1.00000000e-02 3.16227766e-03 3.16227766e-03
  1.00000000e-03 1.00000000e-03 3.16227766e-04 3.16227766e-04]
 [2.00000000e+00 2.00000000e+00 6.32455532e-01 6.32455532e-01
  2.00000000e-01 2.00000000e-01 6.32455532e-02 6.32455532e-02
  2.00000000e-02 2.00000000e-02 6.32455532e-03 6.32455532e-03
  2.00000000e-03 2.00000000e-03 6.32455532e-04 6.32455532e-04]
 [3.00000000e+00 3.00000000e+00 9.48683298e-01 9.48683298e-01
  3.00000000e-01 3.00000000e-01 9.48683298e-02 9.48683298e-02
  3.00000000e-02 3.00000000e-02 9.48683298e-03 9.48683298e-03
  3.00000000e-03 3.00000000e-03 9.48683298e-04 9.48683298e-04]
 [4.00000000e+00 4.00000000e+00 1.26491106e+00 1.26491106e+00
  4.00000000e-01 4.00000000e-01 1.26491106e-01 1.26491106e-01
  4.00000000e-02 4.00000000e-02 1.26491106e-02 1.26491106e-02
  4.00000000e-03 4.00000000e-03 1.26491106e-03 1.26491106e-03]
 [5.00000000e+00 5.00000000e+00 1.58113883e+00 1.58113883e+00
  5.00000000e-01 5.00000000e-01 1.58113883e-01 1.58113883e-01
  5.00000000e-02 5.00000000e-02 1.58113883e-02 1.58113883e-02
  5.00000000e-03 5.00000000e-03 1.58113883e-03 1.58113883e-03]
 [6.00000000e+00 6.00000000e+00 1.89736660e+00 1.89736660e+00
  6.00000000e-01 6.00000000e-01 1.89736660e-01 1.89736660e-01
  6.00000000e-02 6.00000000e-02 1.89736660e-02 1.89736660e-02
  6.00000000e-03 6.00000000e-03 1.89736660e-03 1.89736660e-03]
 [7.00000000e+00 7.00000000e+00 2.21359436e+00 2.21359436e+00
  7.00000000e-01 7.00000000e-01 2.21359436e-01 2.21359436e-01
  7.00000000e-02 7.00000000e-02 2.21359436e-02 2.21359436e-02
  7.00000000e-03 7.00000000e-03 2.21359436e-03 2.21359436e-03]]

paulinpaloalto · January 4, 2026, 6:48pm

Yes, you are taking a tensor and applying sin to it “elementwise” to get another tensor of the same shape. You must not have actually used the same tensor indices on both sides.

paulinpaloalto · January 4, 2026, 6:52pm

Notice that your angles_rad tensor is 8 x 16, so somehow you must have used the whole thing on the RHS instead of slicing it.

Maybe this is a case in which you typed new code in the cell and then just called it again without actually clicking “Shift-Enter” on the modified cell. So WYRINWYS (What You Ran Is Not What You See). Try:

Kernel -> Restart and Clear Output
Cell -> Run All Above

and see if that changes the behavior.

paulinpaloalto · January 4, 2026, 7:17pm

Here’s a post that talks in a bit more detail about this phenomenon.

rvgeiger · January 4, 2026, 8:03pm

Yes, that was it. I think I am going to just run all above every so often just to keep things current.

BTW creating the pos column vector and k row vector was a new challenge for me. Not much guidance on that but I guess they are trying to wean us from the hand holding. I did some research and ended up using np.arange and .reshape. Was that the reasonable approach?

TMosh · January 4, 2026, 8:34pm

An alternative method is using np.arange() and np.newaxis().

This is shown in the test cell just prior to Section 1.2.

paulinpaloalto · January 4, 2026, 8:38pm

Yes, this notebook is the culmination of 5 courses and it’s the one that does the least hand-holding of any so far. Quite a radical step from anything we’ve seen up to this point, so our expectations need to be recalibrated a bit.

It sounds like you came up with a perfectly reasonable solution, but as Tom says, they actually gave you a “worked example” in the previous test cell:

from public_tests import *

get_angles_test(get_angles)

# Example
position = 4
d_model = 8
pos_m = np.arange(position)[:, np.newaxis]
dims = np.arange(d_model)[np.newaxis, :]
get_angles(pos_m, dims, d_model)

It’s not necessary on general principles to understand the test cells, as long as your code passes, but we’re here to learn and you never know what wisdom may lurk even in the parts you don’t technically need to read.

rvgeiger · January 4, 2026, 8:48pm

I stand corrected!

rvgeiger · January 4, 2026, 8:50pm

I haven’t been looking that closely at the test cells, but I will be do so in the future.

rvgeiger · January 4, 2026, 9:56pm

In exercise 3 I am getting the wrong weights. I don’t understand the instruction: Multiply (1. - mask) by -1e9 before applying the softmax. Here are my results:

q,k,v dims =  (3, 4) (4, 4) (4, 2)
matmul_qk dim =  (3, 4)
dk =  4
scaled_att_logits =  tf.Tensor(
[[1.  1.5 0.5 0.5]
 [1.  1.  1.  0.5]
 [1.  1.  0.  0.5]], shape=(3, 4), dtype=float32)
att weights =  tf.Tensor(
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]], shape=(3, 4), dtype=float32)
output =  tf.Tensor(
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]], shape=(3, 2), dtype=float32)
q,k,v dims =  (3, 4) (4, 4) (4, 2)
matmul_qk dim =  (3, 4)
dk =  4
scaled_att_logits =  tf.Tensor(
[[[2.  2.5 0.5 1.5]
  [2.  2.  1.  1.5]
  [2.  2.  0.  1.5]]], shape=(1, 3, 4), dtype=float32)
att weights =  tf.Tensor(
[[[0.33333334 0.45186275 0.3071959  0.33333334]
  [0.33333334 0.27406862 0.5064804  0.33333334]
  [0.33333334 0.27406862 0.18632373 0.33333334]]], shape=(1, 3, 4), dtype=float32)
output =  tf.Tensor(
[[[1.092392   0.33333334]
  [1.1138824  0.33333334]
  [0.7937257  0.33333334]]], shape=(1, 3, 2), dtype=float32)

paulinpaloalto · January 4, 2026, 10:28pm

(1 - mask) “reverses” the mask to select the entries that are 0 in the mask and convert them to 1. If you then multiply by -10^9, it gives you a number which when fed to softmax will give an output that is very very close to 0. Then you add those values to the scaled attention logit values if the mask is present. In other words, that will have the effect of eliminating all entries not selected by the mask.

Here are my equivalent print outputs from that test cell:

q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
attention_weights.shape (3, 4)
attention_weights =
[[0.2589478  0.42693272 0.15705977 0.15705977]
 [0.2772748  0.2772748  0.2772748  0.16817567]
 [0.33620113 0.33620113 0.12368149 0.2039163 ]]
sum(attention_weights(axis = -1)) =
[[1.0000001]
 [1.       ]
 [1.       ]]
output.shape (3, 2)
output =
[[0.74105227 0.15705977]
 [0.7227253  0.16817567]
 [0.6637989  0.2039163 ]]
q.shape (3, 4)
k.shape (4, 4)
v.shape (4, 2)
matmul_qk.shape (3, 4)
matmul_qk =
[[2. 3. 1. 1.]
 [2. 2. 2. 1.]
 [2. 2. 0. 1.]]
dk 4.0
type(scaled_attention_logits) <class 'tensorflow.python.framework.ops.EagerTensor'>
scaled_attention_logits.shape (3, 4)
mask.shape (1, 3, 4)
applying mask =
[[[1 1 0 1]
  [1 1 0 1]
  [1 1 0 1]]]
attention_weights.shape (1, 3, 4)
attention_weights =
[[[0.3071959  0.5064804  0.         0.18632373]
  [0.38365173 0.38365173 0.         0.23269653]
  [0.38365173 0.38365173 0.         0.23269653]]]
sum(attention_weights(axis = -1)) =
[[[1.]
  [1.]
  [1.]]]
output.shape (1, 3, 2)
output =
[[[0.6928041  0.18632373]
  [0.61634827 0.23269653]
  [0.61634827 0.23269653]]]
All tests passed

Notice that your weights and outputs agree with mine for the first test case, but that’s the one with no mask. But then the second test with the mask gives different results, so it’s clear where to look …

Topic		Replies	Views
[week 4] Transformer Network - get_angles Sequence Models week-module-4 , coursera-platform	56	4682	January 10, 2026
[Week 4] Exercise 1 - get_angles Sequence Models coursera-platform	23	1993	February 16, 2026
Course 5 week 4 Sequence Models coursera-platform	1	784	June 27, 2021
Not sure the first unit test is right in week 4 lab Sequence Models week-module-4 , coursera-platform	8	48	May 25, 2025
Week 4 Scaled Dot Product Attention Sequence Models coursera-platform	10	825	October 31, 2021

Exercise 1 angle calculation

Related topics