Regarding e^{<t,t'>}'s in the figure below, are these values and their corresponding \alpha^{<t,t'>}'s scalers?

Yes they are scalars. For an in-depth understanding, please repeat the Neural Machine Translation assignment. Here we have

```
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)
```

and

```
### START CODE HERE ###
# Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
s_prev = None
# Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
# For grading purposes, please list 'a' first and 's_prev' second, in this order.
concat = None
# Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
e = None
# Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
energies = None
# Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
alphas = None
# Use dotor together with "alphas" and "a", in this order, to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
context = None
### END CODE HERE ###
```

You see that we first use 10 units, then 1 unit.

1 Like

For some time I couldn’t understand why we needed `densor2`

but now it makes sense. It combines all of the outputs from `densor1`

and passes it through ReLU to get a single scalar output unit.

1 Like

Same here, had to write down the details a couple of times to notice what was going on.

As I understand, concatenation step yields `(m, Tx, n_s+2*n_a)`

→ this propagates through `Densor1`

to yield `(m, Tx, 10)`

→ which then propagates through `Densor2`

to yield `(m, Tx, 1)`

, i.e. one scalar weight (/energy) value associated with each ‘t’.

We then apply softmax to get weights adding to 1.

1 Like