Attention model: training data for finding e^<t, t'>

Andrew suggested that the e^<t, t’> used to compute the attention coefficients are computed using a neural network. Where do the ground truth values for training that neural network come from?

1 Like

You don’t need ground truth, because it’s only a part of the entire network and will be jointly trained, just like a hidden layer, but this hidden layer is composed of a small network.
You’ll learn another method of calculating attention weights in Week 4, which is called “scaled dot product”.

1 Like