How does attention work

God_of_Calamity · April 30, 2024, 8:24pm

Hey can you tell me if my understanding about the working of attention mechanism is correct

First the all the words in an input sequence have successfully passed throguh the Encoder we get all the hidden states say from (0 to i), that’s why i guess that the “encoder_state” variable has shape(5,16) where 5 is the number of tokens in that in sequence, then these hidden states along with the previous state of the decoder are passed to the attention block, and we concat them to get a (5,32) input which is then passed through the layer_1 and tanh activation to get activations and finally its passed through a softmax layer to get the attention weights that tell which encoder hidden state have more importance for the current decoder hidden state.

what i don’t what is meant by “attention_size” variable and after layer_1 the shape of the activations would be (5,10) like what does this mean, i understand that 5 refers to the number of tokens in the sequence and 10 is attention_size but what does this matrix signify?

arvyzukai · May 1, 2024, 8:37am

Hi @God_of_Calamity

This is a very good question. I’m not confident what course creators’ intention was (why use 10, instead of 16 - the original embedding size), but I believe what is meant by “attention_size” is how much “compressed” would “embeddings” be when calculating attention.

I had a similar question when taking this course a while ago and I still have the explicit calculations (which help me understand the underlying mechanism). So, here they are:

The inputs:

In this case these are the random numbers concatenated:

    # First, concatenate the encoder states and the decoder state.
    inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)

First linear transformation and tanh:

In this case the Linear weights are on the left. When dot multiplied with inputs, it results in the top right values. (number_of_tokens x “attention size” - the one you are asking).

Then “tanh” function is applied to clamp the values form -1 to +1.

These values correspond to this code:

    # Matrix multiplication of the concatenated inputs and the first layer, with tanh activation
    activations = np.tanh(np.matmul(inputs, layer_1))

Second linear transformation to get the “scores” (allignment):

Layer 2 weights are on the left (attention_size, 1).

When dot multiplied with the tanh(layer_1) output, they result in the values on the right.

These values correspond to this code:

# Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here
    scores = np.matmul(activations, layer_2)

So, these would be the output of the allignment() which are used in the attention().

For completeness, here are the attention calculations:

Which would correspond to this code:

    # Then take the softmax of those scores to get a weight distribution
    weights = softmax(scores)
    
    # Multiply each encoder state by its respective weight
    weighted_scores = encoder_states * weights
    
    # Sum up the weights encoder states
    context = np.sum(weighted_scores, axis=0)

Cheers

P.S. don’t forget that this is the Bhadanau, et al (2014) attention which is different from other attention Attention Is All You Need (2017)