In function of Transformer Encoder of A3, why we need to add tl.Select([0], n_in=2)? I learnt that this line pop the top two elements on the stack and place back the original top one. In the context of this model structure, what exactly did we pop out and why doing this?
Hello @YIHUI!
With this we are dropping the masks. Did it make sense for you?
Best regards,
Wesley P.
Hi Wesley,
Just want to make sure that my understanding is correct: For each encoder block:
encoder_block = [
# add `Residual` layer
tl.Residual(
# add norm layer
tl.LayerNorm(),
# add attention
attention,
# add dropout
dropout_,
),
# add another `Residual` layer
tl.Residual(
# add feed forward
feed_forward,
),
]
in the attention part, the output is the activation and mask, and only the activation goes to the feed forward part, leaving the mask unchanged. Thus the final output of an encoder layer is feedforward(activation)+mask, and as you mentioned the mask is dropped with tl.Select. Am I right?
Then what about the case the encoder_blocks in TransformEncoder consist of more than one encoder block? How will the outputs (activation+mask) of the first encoder block pass to the next encoder block? Only the activation or the activation+mask… I am think if only activation goes to the next encoder block then after the chain of encoder blocks, there are multiple masks in the sequence instead of only one…
Hope you can understand my questions. thanks!