Hi @God_of_Calamity
This is a very good question. I’m not confident what course creators’ intention was (why use 10, instead of 16 - the original embedding size), but I believe what is meant by “attention_size” is how much “compressed” would “embeddings” be when calculating attention.
I had a similar question when taking this course a while ago and I still have the explicit calculations (which help me understand the underlying mechanism). So, here they are:
The inputs:
In this case these are the random numbers concatenated:
# First, concatenate the encoder states and the decoder state.
inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
First linear transformation and tanh:
In this case the Linear weights are on the left. When dot multiplied with inputs, it results in the top right values. (number_of_tokens x “attention size” - the one you are asking).
Then “tanh” function is applied to clamp the values form -1 to +1.
These values correspond to this code:
# Matrix multiplication of the concatenated inputs and the first layer, with tanh activation
activations = np.tanh(np.matmul(inputs, layer_1))
Second linear transformation to get the “scores” (allignment):

Layer 2 weights are on the left (attention_size, 1).
When dot multiplied with the tanh(layer_1) output, they result in the values on the right.
These values correspond to this code:
# Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here
scores = np.matmul(activations, layer_2)
So, these would be the output of the allignment()
which are used in the attention()
.
For completeness, here are the attention calculations:
Which would correspond to this code:
# Then take the softmax of those scores to get a weight distribution
weights = softmax(scores)
# Multiply each encoder state by its respective weight
weighted_scores = encoder_states * weights
# Sum up the weights encoder states
context = np.sum(weighted_scores, axis=0)
Cheers
P.S. don’t forget that this is the Bhadanau, et al (2014) attention which is different from other attention Attention Is All You Need (2017)