Transformer Encoder tl.Select

In function of Transformer Encoder of A3, why we need to add tl.Select([0], n_in=2)? I learnt that this line pop the top two elements on the stack and place back the original top one. In the context of this model structure, what exactly did we pop out and why doing this?


Hello @YIHUI!

With this we are dropping the masks. Did it make sense for you?

Best regards,
Wesley P.

Hi Wesley,

Just want to make sure that my understanding is correct: For each encoder block:

encoder_block = [ 
    # add `Residual` layer
        # add norm layer
        # add attention
        # add dropout
    # add another `Residual` layer
        # add feed forward

in the attention part, the output is the activation and mask, and only the activation goes to the feed forward part, leaving the mask unchanged. Thus the final output of an encoder layer is feedforward(activation)+mask, and as you mentioned the mask is dropped with tl.Select. Am I right?

Then what about the case the encoder_blocks in TransformEncoder consist of more than one encoder block? How will the outputs (activation+mask) of the first encoder block pass to the next encoder block? Only the activation or the activation+mask… I am think if only activation goes to the next encoder block then after the chain of encoder blocks, there are multiple masks in the sequence instead of only one…

Hope you can understand my questions. thanks!