Question about attention weights

LuBinLiu · August 18, 2021, 5:29am

Regarding e^{<t,t'>}'s in the figure below, are these values and their corresponding \alpha^{<t,t'>}'s scalers?

jonaslalin · August 24, 2021, 8:53am

Yes they are scalars. For an in-depth understanding, please repeat the Neural Machine Translation assignment. Here we have

repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

and

### START CODE HERE ###
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev = None
    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
    # For grading purposes, please list 'a' first and 's_prev' second, in this order.
    concat = None
    # Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
    e = None
    # Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
    energies = None
    # Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
    alphas = None
    # Use dotor together with "alphas" and "a", in this order, to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = None
    ### END CODE HERE ###

You see that we first use 10 units, then 1 unit.

Yuriy · September 5, 2021, 3:06am

For some time I couldn’t understand why we needed densor2 but now it makes sense. It combines all of the outputs from densor1 and passes it through ReLU to get a single scalar output unit.

Uditgt · July 31, 2022, 5:38am

Same here, had to write down the details a couple of times to notice what was going on.

As I understand, concatenation step yields (m, Tx, n_s+2*n_a) → this propagates through Densor1 to yield (m, Tx, 10) → which then propagates through Densor2 to yield (m, Tx, 1), i.e. one scalar weight (/energy) value associated with each ‘t’.

We then apply softmax to get weights adding to 1.

Topic		Replies	Views
C5 W3 A1 NMT with attention v4 Sequence Models week-module-3 , coursera-platform	1	221	February 12, 2024
Understanding of basic Attention code NLP with Attention Models week-module-1	3	579	August 13, 2023
C5 W3 A1 Neural Machine Translation Sequence Models coursera-platform	4	305	December 14, 2023
Week 3 assignment 1: salient questions about what we're doing Sequence Models coursera-platform	3	550	July 31, 2022
How does attention work NLP with Attention Models course-related , week-module-1 , conceptual-question	1	276	May 1, 2024

Question about attention weights

Related topics