Week 3 - Attention Model Lecture

Hi,

In the week 3 lecture titled ‘Attention Model’, it is said that a small neural network is trained for the computation of e<t, t>. I wanted to know some details about it. Even if we ‘trust’ the backpropagation and gradient descent to find the correct value for e<t, t>, what are we training against as the target, as in the ‘ground truth’? The ‘ground truth’ value of e<t, t> isn’t known beforehand, so I am failing to understand how the training works.

In the lecture, the attention model is used for translating a sentence from french (input) to english (output).
Training data consists of french sentences along with their english translations since this is a supervised learning problem.