Week 3: Distillation: Train a big model, then use that the train a small model, seems convoluted?

Peppers16 · October 11, 2024, 11:18am

In Week 3 when learning about deploying performant models, one method that was discussed was “distillation”.

In order to get a smaller, less expensive model: we take our trained, fine-tuned, RLHF-trained model… then we use that as a “teacher” to train a smaller “student” model to produce the same completions…
but as part of this we also balance: getting the student model to make the same predictions as the teacher model versus: training the student model directly on the training data…

Perhaps I’m being reductive or don’t have a deep enough understanding, but this seems very convoluted and inefficient to me. It doesn’t seem elegant.

Going to the effort of training a big model on training data, going through various tuning steps, only to conclude that the model needs to be smaller, and then training a smaller model to produce the same output using the big model, plus training the smaller model with the data directly? It just seems like there are redundant steps here?

Why would we not just directly do our initial training on the smaller, deployable model? What advantage does the “student”/“teacher” method confer over training the smaller model directly with the same training data? Particularly since this technique seems predicated on the fact that the smaller model is capable of producing the same outputs as the bigger model?

I asked ChatGPT about this and it spoke of how the larger model is more capable of discovering patterns and relationships: whilst the smaller model is capable of representing these patterns it’s less capable of ‘discovering’ them on their own… this kind of helps with my understanding.

Did anyone else find this bit confusing? Anyone got an analogy which would help me understand the benefit?

hangd0gr · October 12, 2024, 3:03am

I would be interested in getting a better understanding as well. The lesson didn’t really provide any use cases beyond saying that it was mostly used on encoder only models. I assumed it would be used for a smaller and more specific set of tasks as the teacher would probably normally be trained for many more than that which your application’s needs.

nadtriana · October 14, 2024, 1:54pm

I agree with both of you. The “student-teacher” method called distillation may seem convoluted, but it has a clear purpose rooted in the differences in how large and small models learn and generalize.

An analogy to illustrate why this approach is valuable is to imagine a master chef (teacher) who has mastered every cooking technique and knows how to create the most complex dishes. Now imagine you’re opening a small bakery where you only need to make a few types of bread and pastries (student). You don’t need to learn every cooking technique from scratch. Instead, the Master Chef can guide you and show you the most relevant techniques for making those specific pastries. You don’t have to spend years perfecting every cooking skill; you just need the shortcuts and insights of the Master to make your bakery successful.

This is how distillation works: instead of having to learn all the intricacies from scratch (which the smaller model may not have the capacity to do; if you train the smaller model directly, it will have less capacity to “see” and capture the intricacies of the data, especially in highly complex tasks), you “distill” the most relevant knowledge from the larger model (the teacher model essentially provides guidance on what patterns are important, streamlining the student’s learning process), allowing the smaller model to specialize in a more focused task.

The analogy here is that the teacher provides a “guide” or “cheat sheet” for the student, allowing the student to focus on the important parts of the task without having to discover everything from scratch. This is especially important in real-world applications where we want to reduce the computational cost (smaller models) while maintaining performance. Distillation helps bridge the gap between a model that’s too large for practical use and a model that’s too small to capture the complexity of the task. It acts as a way to efficiently transfer knowledge, ensuring that a smaller model can still perform well in specific applications while being more lightweight and efficient. It’s not just about reducing model size - it’s about improving the smaller model’s ability to perform well by standing on the shoulders of a giant.

vaiyer · January 25, 2025, 10:28pm

I had the same question, and concluded the following: When you distill you have access to more than just the “next token”. You have the entire probability distribution over all possible next tokens. This represents a lot more information about how the larger model generalizes that can be passed on to the smaller model.

This paper discusses this benefit:
“Distilling the Knowledge in a Neural Network” by Hinton, Vinyals and Dean arXiv:1503.02531

This is why the loss function for distillation can be some function of the logits rather than just the predicted token. In the above paper, they use the final probabilities lifted to an exponent to soften the distribution (i.e., high temperature)

Another potential value of distillation is that you can sample infinite data from the larger model and perhaps there exists some sampling distribution that would be better to train on for the smaller model.

Topic		Replies	Views
Questions About Distillation Generative AI with Large Language Models week-module-3	2	349	October 9, 2023
Model Optimization: Distillation "Hard" Values Unclear Generative AI with Large Language Models week-module-3	0	263	February 12, 2024
Predictions type during distillation optimization Generative AI with Large Language Models week-module-3	5	310	November 5, 2023
WEEK 3: What's difference between hard predictions and hard labels? Generative AI with Large Language Models week-module-3	1	359	October 8, 2023
Does BloombergGPT contradict Chinchilla and Llama papers? Generative AI with Large Language Models week-module-1	4	537	July 7, 2023

Week 3: Distillation: Train a big model, then use that the train a small model, seems convoluted?

Related topics