Questions About Distillation

Paul_Baclace · October 9, 2023, 12:05am

In the week3 Distillation lecture, we see how to make a model smaller for runtime using a fairly complex process compared to normal backpropagation.

Since training data used to train the teacher model is required, it seems like there would be no need for a teacher: just train the smaller model with that data! Obviously there must be more to it.

The key advantage over just training a smaller model on the same dataset was only implied. Reading elsewhere it seems to be that the student model has the advantage of seeing the label probabilities of the teacher. This imparts more information than the usual X,Y training set. Obviously the student needs to have the same number of labels for the math to work out to compute the distillation loss.

Questions:
How well does distillation typically work in terms of size reduction and performance?
Only classification is discussed, but can next token prediction also work for an encoder only model?

gent.spah · October 9, 2023, 2:57pm

Size is reduced for sure and that the reason of training the student, performance could drop although sometimes might not because an optimal state of neural network does not depend entirely on size!

Jamal022 · October 9, 2023, 10:32pm

Hey @Paul_Baclace,

On top of what @gent.spah said let me add more notes.

Well as @gent.spah said that size is reduced. The reduction in model size can vary depending on how aggressive the compression is.

There is usually a trade-off between model size and performance. Smaller student models may not perform as well as the larger teacher model, especially on complex tasks. However, the goal is to strike a balance where the reduction in model size is acceptable given the minor drop in performance.

Also note that the effectiveness of distillation can vary depending on the task. For tasks where the teacher model performs very well, distillation tends to work better. It’s less effective for tasks where the teacher model is not very accurate.

One of the key advantages of distillation is that it can be more data-efficient than training a smaller model from scratch. The student model learns not only from the hard labels (ground truth) but also from the soft labels (teacher’s predictions), which contain more information about the relationships between classes or tokens.

Regarding your question about next token prediction for encoder-only models (typically used in tasks like language modeling or machine translation), distillation can indeed be applied to these models. The key idea is to have a teacher model (usually a larger one) make predictions for the next token, and then train a smaller student model to mimic these predictions. The student model can be an encoder-only model or any other architecture that suits the task.

In the context of encoder-only models, the teacher model’s predictions can provide valuable information about the relationships between tokens, helping the student model improve its performance. This approach can lead to a smaller model that retains most of the language modeling or translation capabilities of the teacher model while being more computationally efficient.

I hope it makes sense for you now and feel free to ask for more clarifications.
Regards,
Jamal

Topic		Replies	Views
Week 3: Distillation: Train a big model, then use that the train a small model, seems convoluted? Generative AI with Large Language Models course-related , conceptual-question , distillation	3	215	January 25, 2025
Model Optimization: Distillation "Hard" Values Unclear Generative AI with Large Language Models week-3	0	252	February 12, 2024
How small is the 'small training set'? Neural Networks and Deep Learning	4	636	April 24, 2023
W1_Quiz_Large NN Models vs Traditional Learning Neural Networks and Deep Learning	3	659	February 5, 2023
Right-Sizing Models for the Dataset: Finding the Best Data-To-Parameter Ratio for NLP Models AI Discussions the-batch , ai-discussions	1	71	May 20, 2023

Questions About Distillation

Related topics