Hi,
In the week 3 lecture titled ‘Attention Model’, it is said that a small neural network is trained for the computation of e<t, t ’ >. I wanted to know some details about it. Even if we ‘trust’ the backpropagation and gradient descent to find the correct value for e<t, t ’ >, what are we training against as the target, as in the ‘ground truth’? The ‘ground truth’ value of e<t, t ’ > isn’t known beforehand, so I am failing to understand how the training works.