Attention model video formula error?

In the Attention model video, Andrew NJ shows the representation of alpha (attention) and context.


  • its a french translation
  • french sequence timestep represented by t’
  • translated sequence timestep represented by t.

But it doesn’t add up to me that the if sum of all alpha for each t' (t-prime) is 1, and that all alpha is considered in the equation for c^< t >, then how would the context be different for words at other t's (c^<1>, c^<2> etc…).

I guess Andrew misses to mention the window limit for t' chosen for the context perhaps ?

Although the sum of image is 1, context is composed of the sum of (image a). Intuitively, it’s the weighted sum of a. Each target word has different image distribution, thus, different context.

But as long as it is summed across all of t', the formula doesn’t make sense to me. It is just like passing in a^<t'> directly, without the alpha part. Right?

Because in all the iteration of a, the alpha is always/anyways going to sum up to 1. I hope I’m making sense to you.

Suppose we’ve an example as below.
Source sentence: Jane visite l'Afrique en septembre
Target sentence: Jane visits Africa in September

We calculate the attention context for each target word as below.

C<Jane> = α<1,1>a<1> + α<1,2>a<2> + ... + α<1,5>a<5>
C<visits> = α<2,1>a<1> + α<2,2>a<2> + ... + α<2,5>a<5>
C<Africa>= α<3,1>a<1> + α<3,2>a<2> + ... + α<3,5>a<5>
C<in> = α<4,1>a<1> + α<4,2>a<2> + ... + α<4,5>a<5>
C<September> = α<5,1>a<1> + α<5,2>a<2> + ... + α<5,5>a<5>

constrain: image for each context.

which also means
C<Jane> != C<visits> != C<Africa> != C<in> != C<September>

Did you see each context C<t> has different α sequence? For instance,

[α<3,1>, α<3,2>, ..., α<3,5>] is probably  [0.01, 0.15, 0.82, 0.01, 0.01] for C<Africa>
[α<5,1>, α<5,2>, ..., α<5,5>] may be [0.002, 0.003, 0.005, 0.16, 0.83] for C<September>

It’s because Africa might pay more attention to l'Afrique, and September might focus on septembre.
Does it convince you the context of each target word is different, even though the summation of alpha is 1?