C5 W3 A1 Neural Machine Translation

Hi, I am stuck with dense layers dimensions in one_step_attention(). Concretely:

  1. Densor1 has 10 units (neurons), which means that the output consists of 10 values
  2. Densor2 has 1 unit, which means that the output is 1 value
  3. The output of densor2 is passed over to activator which produces alphas.

I know there must be many alphas (not just one) but how is it possible if the output of densor2 is just one value?

I’m not sure what you mean by “alphas” here, or why you say there must be more than one.

Re alphas: I am referring to this part of one_step_attention(): alphas = activator(energies).
Why am I assuming there must be many alphas? Because we later use dotor which makes a sum (a dot product, in fact) of many a’s and alphas in accordance with that equation

The “alphas” are just the input the the softmax layer.
The output is a probability for each value of ‘t’.

But I understand your question - softmax is typically only used when there are multiple outputs, and you want to re-scale them so they sum to 1 for each example.

Maybe what they’re doing is, since the previous layer uses ReLU activation, they’re using softmax to just re-scale it to the range 0 to 1.

1 Like

Info for other learner who may struggle with the topic above: look very carefully into the dimensions of every and each layer in the function one_step_attention() It will be very teaching to discover how those dimensions change from layer to layer.