Question on hidden state size for large sequence

Natural Language Processing with Attention Models

Question on hidden state size for large sequence

as mentioned in C4W1_Assignment:

To produce the next prediction, the attention layer will first receive all the encoder hidden states (i.e. orange rectangles) as well as the decoder hidden state when producing the word “como”

I am wondering, the encoder hidden states are limited, for a large sequence, the encoder hidden states cannot hold all words in the large sequence, this will still degrade the effective.

Could you please help me to understand?

hi @dsong99

Your understanding is correct!

In a basic RNN encoder-decoder architecture without attention, the encoder processes the entire input sequence and compresses all information into a final, fixed-size hidden state.

The decoder then uses only this single vector to generate the entire output sequence.

As the input sequence length increases, it becomes impossible for this fixed-size vector to retain all the necessary information, especially from the earlier parts of the sequence.

This issue is addressed by the attention mechanism which changes the interaction between encoder and decoder. How this is done

  1. Access to all encoder states:
    Instead of the encoder passing only its final hidden state, it passes all its hidden states (the orange rectangles in your diagram) to the decoder. Each hidden state is a contextual representation of the corresponding word and its surroundings in the input sequence.

  2. Computing a dynamic context vector: At each step where the decoder generates an output word (for example, como), the attention mechanism computes a new context vector. This new vector is a weighted sum of all the encoder’s hidden states.

  3. Focused selection: The weights in this sum are determined by an “alignment score” between the current decoder hidden state and each encoder hidden state. This allows the model to dynamically focus on the most relevant parts of the input sequence for producing the current output word.

The same ''attention mechanism" was used to build the famous Transformer using self-attention, allowing every word in a sequence to attend to every other word, capturing long-range dependencies efficiently and in parallel.

Regards
DP

Thanks, I guess these words matters: “Each hidden state is a contextual representation of the corresponding word and its surroundings in the input sequence.”

So each hidden state is not a representation of a single word.

BTW: I may need to read more on attention model, do you have a few books to recommend?

Correct when it comes to longer sequence, sentences are tokenized into chunks (group of words) especially one is training model with more than billons token, like the famous LLM models.

when it comes to attention mechanism, again there are so many techniques applied,

Like there is BERT model which is Bidirectional encoder representation from Transformer

ColBERT which is contextual BERT

if you are interested in nlp techniques particular try some of the short courses, I recently tested RAG course which actually cover many of the techniques but do it only if you are well versed with other basic courses.

Books don’t know but I have the research paper introduction of Transformer which explains alot about attention mechanism, sharing here

Transformer.pdf (2.1 MB)

Thanks

1 Like