Sharing: Transformer is like a show

Damon · June 23, 2021, 1:08am

Transformer is so powerful, and here I try to make it colorful, by interpreting it with a metaphor: Encoder = Stage Show, Decoder = Oscar Award Assessment.

In this new perspective, we see the Decoder process as performing a stage show, and the Decoder process is just like the Oscar Award Committee assessing the show.

Below is the graph and interpretation:

Part 1: Encoder = Stage Show

Phase:

P1. Ideas → Conceiving_Story → Script

Performing bases on script, script inspires by story, story starts with ideas.
Likely, at the beginning of Transformer, we have raw sequence, then we preprocess it to fit the shape of the model.

P2. Script Interpretation → Rehearsal → Adjustment

Script Interpretation: x is the performing script, can be interpreted from 3 different aspects:
- Q = plots, K = roles, V = characteristics
- A script is consisted of a series of plots, each plot is performed by different roles, each role has his own characteristics.
- In other words, different role performs different characteristic according to the plot, just like different key matches with different query.
- Further, the same role may perform a different characteristic at different plot, just like the same word may have different meaning at different position of the sequence.
- If Q=K=V=x, which is plots=roles=characteristics=script, it means s a solo performing, and the actor can perform according to his will to express the theme of the story. In keras Multi-Head Attention API, this is called “self-attention”.
Rehearsal: Multi-Head attention is interpreted as rehearsal, since it processes script interpretation (Q, K, V), just like actors have a rehearsal after understanding the script.
Adjustment: Dropout → ResAdd → LayerNorm
- Dropout: randomly cut some plot of the story, in case of highly depending on some actors’ personal performing or mainly betting on the climax
- ResAdd: see the connection between each plot, not to isolate them
- LayerNorm: normalize actors’ performing, in case of actors bringing in too much personal characteristics

P3. Crew Discussion → Adjustment

Crew Discussion: FullyConnected is interpreted as crew discussion to abstract the performing features and flaws.
Adjustment: like above P2.

P4. Repeat P2 - P3 to practice N rounds

After the repeatedly practices, the real_show (enc_output) will go live on stage.

P5. Real Show Performing

The assessment members will watch & record this stage show, just like enc_output will be passed to Decoder.

Pseudo code:

P1: input_x → preprocess(Embedding → Scale → Pos_encoding → Dropout) → x
P2: Query=x, Value=x, Key=x, enc_padding_mask → Multi_Head_Attention → adjust(Dropout → ResAdd → LayerNorm) → out1
P3: out1 → FC → adjust(Dropout → ResAdd → LayerNorm) → out2
P4: loop(P2 - P3) → update(out2)
P5: enc_output = out2

Part 2: Decoder = Oscar Award Assessment

Phase:

P1. Rumors → Dig into the story

At the very beginning, the show is not on, bu rumors () already spread around, attracting people’s attention, preparing for the premiere.
When assessment members hear this rumors, they start to dig into the story, by reading introduction, comments, etc…
Likely, we have no real input target but a sign of it at the beginning of Decoder, then we do a preprocess for it.

P2. Ask Questions → Adjustment

Ask Questions:
- The first Multi-Head Attention outputs a Query, which is like people may think & ask some questions about what they heard & red.
- So, in code, the input of the first Multi-Head Attention (MHA) is: target, target, target. Because when the curiosity is initially triggered, all you thinking is: more, more, more on the topic.
Adjustment: see below P3.

P3. Watch the Show & Answer the Questions → Adjustment

Watch the Show & Answer the Questions:
- Connectedly, the second MHA is to answer the questions asked in the first MHA by watching the show.
- So, in code, the Query from the first MHA and the enc_output from Encoder is passed to the second MHA as input. Since both Key and Value information are from the show, so we set Key=enc_output, Value=enc_output.
Adjustment: adjustment in assessment is interpreted differently with adjustment in performing
- Dropout: randomly delete some opinions of the jury, in case of the manipulation of authority
- ResAdd: evaluate the show comprehensively with previous assessment , not to view it isolatly
- LayerNorm: normalize the assessment to standard criterion

P4. Assessment Discussion → Adjustment

Assessment Discussion: this FullyConnected is interpreted as a discussion about the evaluation & criterion stuff.
Adjustment: like above P3.

P5. Repeat P2 - P4 to assess N rounds

Assess several rounds, ensuring the show is well & fair understood and evaluated.

P6. Voting → Oscar Awards Ceremony

Softmax is like voting, disclosing the final winner of the Oscar Award.

Pseudo code:

P1: SOS → preprocess(Embedding → Pos_encoding) → target
P2: Query=target, Value=target, Key=target, look_ahead_mask → Multi-Head Attention → adjust(Dropout → ResAdd → LayerNorm) → out1
P3: Query=out1, Value=enc_output, Key=enc_output, dec_padding_mask → Multi-Head Attention → adjust(Dropout → ResAdd → LayerNorm) → out2
P4: out2 → FC → adjust(Dropout → ResAdd → LayerNorm) → out3
P5: loop(P2 - P4) → update(out3) → dec_output = out3
P6: dec_output → Dense(‘softmax’) → ŷ

Part 3: Summary

What making the transformer special comparing to other models, is just like the reason why stage show is different with a film:

For stage show, all acts can be performed together at the same time as long as the imagination as well as the stage is big enough, whereas a film is a fixed time sequence that can be only played one screen at a time.

Transformer is like a show, attention is all you need.

TMosh · June 30, 2021, 4:54am

Thanks for your summary.

Topic		Replies	Views
Week 4: Transformer Network (test time intuition) Sequence Models coursera-platform	1	519	April 21, 2022
I can't quite understand the transformer structure NLP with Sequence Models week-module-4	8	1192	August 25, 2023
Questions about Transformer Models Generative AI with Large Language Models week-module-1	2	375	October 23, 2023
Problem with transformer NLP with Attention Models week-module-2	1	488	May 28, 2023
Mask Multi Head Attention Sequence Models coursera-platform	5	632	May 2, 2022

Sharing: Transformer is like a show

Related topics