C5_W4 Encoder/Encoder_layer - Question about setting training = training important

In the Encoder/Encoder_layer class, the test passes whether I set training = training inside each layer’s call method or not, so for which type of layers is setting training = training inside the layer call required?

“training=training” is only used in the dropout layers.

2 Likes

I removed training = training from the dropout layer and it still passed the Encoder_test? Also, why is training = training not required in normalization layers and mha layer (which has dropout)?

The unit test for EncoderLayer() doesn’t test for both training values. It is being updated.
I don’t know the answer to your other question.

2 Likes

LayerNormalization is different from batch normalization and doesn’t keep track of batch statistics so the behavior during training and inference is not different, besides that weights aren’t updated during inference. Hence no need for the training flag. The behavior should be the same for training=true and false during inference.

Multiheadattention doesn’t contain dropout?

1 Like

Thanks for the clarification regarding LayerNormalization! In the Multiheadattention documentation there is a dropout parameter in the object definition and the documentation states regarding training:

training: Python boolean indicating whether the layer should behave in training mode (adding
dropout) or in inference mode (no dropout). Defaults to either using the training mode of the
parent layer/model, or False (inference) if there is no parent layer.

This seems to imply that if I don’t specify training for the first mha layer, it defaults to training = False?

One thing that looks strange at first is that dropout is applied directly to the attention . This means that attention vector will most probably not sum to 1 and may pay full attention to a token but the attention over that token is set to 0 by dropout. This is never explained, or even mentioned, in the paper however is used by the official implementation and every Transformer implementation, including BERT.

I checked Keras’ implementation now and you should of course set training=training here as well. They use dropout, and I didn’t know that. I will report this upstream. Good catch @LuBinLiu :1st_place_medal:

Edit: See below for correct answer.

1 Like

Doesn’t this imply that training should also be passed in UNQ_C6 DecoderLayer mha1 and mha2?

Thanks!

Oh, and as a follow-up, aside from whether or not the grader tests for this, the hint for UNQ_C4 EncoderLayer pretty explicitly states that training should not be set:
" Let the default values for return_attention_scores and training

Why is that? Is it really correct?

Thanks again!

1 Like

Actually, as the course staff was so kind to point out to me, training=training is the default behaviour for the multi-head attention layer when used as a child layer. In other words, it is not necessary to pass this to the call method.

  • training : Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (no dropout). Defaults to either using the training mode of the parent layer/model, or False (inference) if there is no parent layer.

I hope my new reply regarding training=training helps answer your question as well. You can pass training=training, but do not need to, since that is the default behaviour anyway. I did not know that at the time.

Hi Jonas,

Thanks, that’s pretty much what I’d concluded too.
Still, it the explicit hint to leave training at the default strikes me as strange and superfluous.

Regards,
-jh-

Yes, I agree. We should improve on the instructions.