Question about attention weights

Regarding e^{<t,t'>}'s in the figure below, are these values and their corresponding \alpha^{<t,t'>}'s scalers?

Yes they are scalars. For an in-depth understanding, please repeat the Neural Machine Translation assignment. Here we have

repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

and

### START CODE HERE ###
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev = None
    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
    # For grading purposes, please list 'a' first and 's_prev' second, in this order.
    concat = None
    # Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
    e = None
    # Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
    energies = None
    # Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
    alphas = None
    # Use dotor together with "alphas" and "a", in this order, to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = None
    ### END CODE HERE ###

You see that we first use 10 units, then 1 unit.

1 Like

For some time I couldn’t understand why we needed densor2 but now it makes sense. It combines all of the outputs from densor1 and passes it through ReLU to get a single scalar output unit.

1 Like

Same here, had to write down the details a couple of times to notice what was going on.

As I understand, concatenation step yields (m, Tx, n_s+2*n_a) → this propagates through Densor1 to yield (m, Tx, 10) → which then propagates through Densor2 to yield (m, Tx, 1), i.e. one scalar weight (/energy) value associated with each ‘t’.

We then apply softmax to get weights adding to 1.

1 Like