Week 3: Distillation: Train a big model, then use that the train a small model, seems convoluted?

In Week 3 when learning about deploying performant models, one method that was discussed was “distillation”.

In order to get a smaller, less expensive model: we take our trained, fine-tuned, RLHF-trained model… then we use that as a “teacher” to train a smaller “student” model to produce the same completions…
but as part of this we also balance: getting the student model to make the same predictions as the teacher model versus: training the student model directly on the training data…

Perhaps I’m being reductive or don’t have a deep enough understanding, but this seems very convoluted and inefficient to me. It doesn’t seem elegant.

Going to the effort of training a big model on training data, going through various tuning steps, only to conclude that the model needs to be smaller, and then training a smaller model to produce the same output using the big model, plus training the smaller model with the data directly? It just seems like there are redundant steps here?

Why would we not just directly do our initial training on the smaller, deployable model? What advantage does the “student”/“teacher” method confer over training the smaller model directly with the same training data? Particularly since this technique seems predicated on the fact that the smaller model is capable of producing the same outputs as the bigger model?

I asked ChatGPT about this and it spoke of how the larger model is more capable of discovering patterns and relationships: whilst the smaller model is capable of representing these patterns it’s less capable of ‘discovering’ them on their own… this kind of helps with my understanding.

Did anyone else find this bit confusing? Anyone got an analogy which would help me understand the benefit?

I would be interested in getting a better understanding as well. The lesson didn’t really provide any use cases beyond saying that it was mostly used on encoder only models. I assumed it would be used for a smaller and more specific set of tasks as the teacher would probably normally be trained for many more than that which your application’s needs.

I agree with both of you. The “student-teacher” method called distillation may seem convoluted, but it has a clear purpose rooted in the differences in how large and small models learn and generalize.

An analogy to illustrate why this approach is valuable is to imagine a master chef (teacher) who has mastered every cooking technique and knows how to create the most complex dishes. Now imagine you’re opening a small bakery where you only need to make a few types of bread and pastries (student). You don’t need to learn every cooking technique from scratch. Instead, the Master Chef can guide you and show you the most relevant techniques for making those specific pastries. You don’t have to spend years perfecting every cooking skill; you just need the shortcuts and insights of the Master to make your bakery successful.

This is how distillation works: instead of having to learn all the intricacies from scratch (which the smaller model may not have the capacity to do; if you train the smaller model directly, it will have less capacity to “see” and capture the intricacies of the data, especially in highly complex tasks), you “distill” the most relevant knowledge from the larger model (the teacher model essentially provides guidance on what patterns are important, streamlining the student’s learning process), allowing the smaller model to specialize in a more focused task.

The analogy here is that the teacher provides a “guide” or “cheat sheet” for the student, allowing the student to focus on the important parts of the task without having to discover everything from scratch. This is especially important in real-world applications where we want to reduce the computational cost (smaller models) while maintaining performance. Distillation helps bridge the gap between a model that’s too large for practical use and a model that’s too small to capture the complexity of the task. It acts as a way to efficiently transfer knowledge, ensuring that a smaller model can still perform well in specific applications while being more lightweight and efficient. It’s not just about reducing model size - it’s about improving the smaller model’s ability to perform well by standing on the shoulders of a giant.