In the video, Andrew calculates the energies from the a’s and s using a single layer - but in the assignment two layers (densor1 and densor2) are used to get to the energies tensor.
Is there a reason why we need two linear layers of processing? Can we go directly to energies from the concatenated tensor of a and s?
As the comment says, intermediate energies are computed using a tanh layer before computing the final energies.
The reason I’m okay with that explanation is that a deeper network can learn fit the data more effectively than a shallower network.
If you’d like to remove the intermediate step, please do so after passing the assignment.