Training Attention Weights


I had a question about training attention weights. In the lecture we are told that the weights are trained using a single hidden layer that uses S and a<t’> as inputs. Is this network part of the larger attention network and does information pass through it during forward and backward prop?

Thank you.

1 Like

Do you still need an answer to this question?