Multi-Task Training Strategy questions

In C4W3 lecture video “Multi-Task Training Strategy” I had a few questions.

In the data training strategies slide, examples-proportional and equal mixing are introduced. What’s the difference between Data 1 and Data 2 in the diagram? Do they each refer to a different type of task? This was unclear.

What does unfreezing mean? In the Gradual Unfreezing vs. Adapter Layers slide, the video explains that gradual unfreezing means you “unfreeze one layer a time”. But it doesn’t explain what unfreezing actually means. Why do you want to unfreeze?

During the Fine Tuning slide Younes says “they do the training in 2^18 steps”. Who is “they”? Is this slide about a specific model?

Hi @esav

You got that right.

Unfreezing here means that the “unfrozen” layer’s weight can now be updated. When they are “frozen” the gradient descent cannot change them.

Here they refer to T5 authors or creators, not sure what word to use :slight_smile: What is meant by this is that in the paper they show what “they” did, and in particular, during fine-tuning phase they used 2^18 steps. From the paper:

Recall that our baseline model consists of 12 layers each in the encoder and decoder and is fine-tuned for 2^{18} steps. As such, we subdivide the fine-tuning process into 12 episodes of 2^{18} / 12 steps each and train from layers 12 − n to 12 in the nth episode.


Thanks for the help!