Re alphas: I am referring to this part of one_step_attention(): alphas = activator(energies).
Why am I assuming there must be many alphas? Because we later use dotor which makes a sum (a dot product, in fact) of many a’s and alphas in accordance with that equation
The “alphas” are just the input the the softmax layer.
The output is a probability for each value of ‘t’.
But I understand your question - softmax is typically only used when there are multiple outputs, and you want to re-scale them so they sum to 1 for each example.
Maybe what they’re doing is, since the previous layer uses ReLU activation, they’re using softmax to just re-scale it to the range 0 to 1.
Info for other learner who may struggle with the topic above: look very carefully into the dimensions of every and each layer in the function one_step_attention() It will be very teaching to discover how those dimensions change from layer to layer.