In the week3 Distillation lecture, we see how to make a model smaller for runtime using a fairly complex process compared to normal backpropagation.
Since training data used to train the teacher model is required, it seems like there would be no need for a teacher: just train the smaller model with that data! Obviously there must be more to it.
The key advantage over just training a smaller model on the same dataset was only implied. Reading elsewhere it seems to be that the student model has the advantage of seeing the label probabilities of the teacher. This imparts more information than the usual X,Y training set. Obviously the student needs to have the same number of labels for the math to work out to compute the distillation loss.
Questions:
How well does distillation typically work in terms of size reduction and performance?
Only classification is discussed, but can next token prediction also work for an encoder only model?
Size is reduced for sure and that the reason of training the student, performance could drop although sometimes might not because an optimal state of neural network does not depend entirely on size!
Hey @Paul_Baclace,
On top of what @gent.spah said let me add more notes.
Well as @gent.spah said that size is reduced. The reduction in model size can vary depending on how aggressive the compression is.
There is usually a trade-off between model size and performance. Smaller student models may not perform as well as the larger teacher model, especially on complex tasks. However, the goal is to strike a balance where the reduction in model size is acceptable given the minor drop in performance.
Also note that the effectiveness of distillation can vary depending on the task. For tasks where the teacher model performs very well, distillation tends to work better. It’s less effective for tasks where the teacher model is not very accurate.
One of the key advantages of distillation is that it can be more data-efficient than training a smaller model from scratch. The student model learns not only from the hard labels (ground truth) but also from the soft labels (teacher’s predictions), which contain more information about the relationships between classes or tokens.
Regarding your question about next token prediction for encoder-only models (typically used in tasks like language modeling or machine translation), distillation can indeed be applied to these models. The key idea is to have a teacher model (usually a larger one) make predictions for the next token, and then train a smaller student model to mimic these predictions. The student model can be an encoder-only model or any other architecture that suits the task.
In the context of encoder-only models, the teacher model’s predictions can provide valuable information about the relationships between tokens, helping the student model improve its performance. This approach can lead to a smaller model that retains most of the language modeling or translation capabilities of the teacher model while being more computationally efficient.
I hope it makes sense for you now and feel free to ask for more clarifications.
Regards,
Jamal