I just finished the final assignment for course 5. It is more difficult than other assignments as fewer hints were provided to the students, but I think they are still reasonable.
Here are a few things I find confusing and also suggestions on possible revision. Please let me know if I misunderstood something.
2.1 - Padding Mask
There is a block of codes that tries demonstrate the effect of the masking
print(tf.keras.activations.softmax(x))
print(tf.keras.activations.softmax(x + (1 - create_padding_mask(x)) * -1.0e9))
The code produces the following output. Note the shape of the two outputs.
tf.Tensor(
[[7.2876644e-01 2.6809821e-01 6.6454901e-04 6.6454901e-04 1.8064314e-03]
[8.4437378e-02 2.2952460e-01 6.2391251e-01 3.1062774e-02 3.1062774e-02]
[4.8541026e-03 4.8541026e-03 4.8541026e-03 2.6502505e-01 7.2041273e-01]], shape=(3, 5), dtype=float32)
tf.Tensor(
[[[7.2973627e-01 2.6845497e-01 0.0000000e+00 0.0000000e+00 1.8088354e-03]
[2.4472848e-01 6.6524094e-01 0.0000000e+00 0.0000000e+00 9.0030573e-02]
[6.6483547e-03 6.6483547e-03 0.0000000e+00 0.0000000e+00 9.8670328e-01]]
[[7.3057163e-01 2.6876229e-01 6.6619506e-04 0.0000000e+00 0.0000000e+00]
[9.0030573e-02 2.4472848e-01 6.6524094e-01 0.0000000e+00 0.0000000e+00]
[3.3333334e-01 3.3333334e-01 3.3333334e-01 0.0000000e+00 0.0000000e+00]]
[[0.0000000e+00 0.0000000e+00 0.0000000e+00 2.6894143e-01 7.3105860e-01]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 5.0000000e-01 5.0000000e-01]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 2.6894143e-01 7.3105860e-01]]], shape=(3, 3, 5), dtype=float32)
It was quite weird to see the data shape change from (3, 5)
to (3, 3, 5)
after the masking. It may be better to rewrite the code as
print(tf.keras.activations.softmax(x))
mask = tf.reshape(1 - create_padding_mask(x) * -1.0e9, x.shape)
print(tf.keras.activations.softmax(x + mask))
This produces the output
tf.Tensor(
[[7.2876644e-01 2.6809821e-01 6.6454901e-04 6.6454901e-04 1.8064314e-03]
[8.4437378e-02 2.2952460e-01 6.2391251e-01 3.1062774e-02 3.1062774e-02]
[4.8541026e-03 4.8541026e-03 4.8541026e-03 2.6502505e-01 7.2041273e-01]], shape=(3, 5), dtype=float32)
tf.Tensor(
[[0.33333334 0.33333334 0. 0. 0.33333334]
[0.33333334 0.33333334 0.33333334 0. 0. ]
[0. 0. 0. 0.5 0.5 ]], shape=(3, 5), dtype=float32)
Exercise 3 - scaled_dot_product_attention
The shape of the function argument was labeled as (..., seq_len_q, depth)
.
def scaled_dot_product_attention(q, k, v, mask):
"""
Arguments:
q -- query shape == (..., seq_len_q, depth)
k -- key shape == (..., seq_len_k, depth)
v -- value shape == (..., seq_len_v, depth_v)
mask: Float tensor with shape broadcastable
to (..., seq_len_q, seq_len_k). Defaults to None.
"""
The use of “…” is a bit confusing and also inconsistent with rest of the assignment. Maybe use (batch_size, seq_len_q, depth)
?
Exercise 8 - Transformer
The function comment was confusing
class Transformer(tf.keras.Model):
def call(self, input_sentence, output_sentence, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
"""
Forward pass for the entire Transformer
Arguments:
input_sentence -- Tensor of shape (batch_size, input_seq_len, fully_connected_dim)
An array of the indexes of the words in the input sentence
output_sentence -- Tensor of shape (batch_size, target_seq_len, fully_connected_dim)
An array of the indexes of the words in the output sentence
Shouldn’t input_sentence
have the shape (batch_size, input_seq_len)
as they are batch of character sequences? The same issue also goes for output_sentence