Hi @God_of_Calamity

This is a very good question. I’m not confident what course creators’ intention was (why use 10, instead of 16 - the original embedding size), but I believe what is meant by “attention_size” is how much “compressed” would “embeddings” be when calculating attention.

I had a similar question when taking this course a while ago and I still have the explicit calculations (which help me understand the underlying mechanism). So, here they are:

# The inputs:

In this case these are the random numbers concatenated:

```
# First, concatenate the encoder states and the decoder state.
inputs = np.concatenate((encoder_states, decoder_state.repeat(input_length, axis=0)), axis=1)
```

### First linear transformation and tanh:

In this case the Linear weights are on the left. When dot multiplied with inputs, it results in the top right values. (number_of_tokens x “**attention size**” - the one you are asking).

Then “tanh” function is applied to clamp the values form -1 to +1.

These values correspond to this code:

```
# Matrix multiplication of the concatenated inputs and the first layer, with tanh activation
activations = np.tanh(np.matmul(inputs, layer_1))
```

### Second linear transformation to get the “scores” (allignment):

Layer 2 weights are on the left (**attention_size**, 1).

When dot multiplied with the tanh(layer_1) output, they result in the values on the right.

These values correspond to this code:

```
# Matrix multiplication of the activations with the second layer. Remember that you don't need tanh here
scores = np.matmul(activations, layer_2)
```

So, these would be the output of the `allignment()`

which are used in the `attention()`

.

For completeness, here are the attention calculations:

Which would correspond to this code:

```
# Then take the softmax of those scores to get a weight distribution
weights = softmax(scores)
# Multiply each encoder state by its respective weight
weighted_scores = encoder_states * weights
# Sum up the weights encoder states
context = np.sum(weighted_scores, axis=0)
```

Cheers

P.S. don’t forget that this is the Bhadanau, et al (2014) attention which is different from other attention Attention Is All You Need (2017)