Predictions type during distillation optimization

What is the difference between a soft and hard predictions by student LLM during LLM optimization by distillation?

1 Like

Hey @saileshbaidya,

Okay in the context of LLM (Language Model) optimization through distillation, “soft” and “hard” predictions refer to two different approaches for transferring knowledge from a teacher model to a student model. Let me explain the differences between these two types of predictions:

  1. Hard Predictions:

    • Hard predictions are binary or discrete values. In the context of natural language processing, they often represent the most likely output for a given input.

    • When a teacher model makes hard predictions, it means it directly produces discrete values, such as specific words or tokens. These predictions are not probabilistic or continuous.

    • They are typically easier for the student model to learn from because they provide clear, deterministic targets

    • Teacher model’s hard predictions are often used as labels to train the student model to mimic the teacher’s decision-making process.

  2. Soft Predictions:

    • Soft predictions are continuous or probabilistic values. They represent the likelihood or distribution of different outcomes.

    • In the context of language models, soft predictions can be a probability distribution over the entire vocabulary for each position in a sequence. They provide information about the model’s uncertainty and confidence in various token choices.

    • Soft predictions are often used when a teacher model produces a probability distribution over possible outputs. This distribution can be used to convey not only the most likely choice but also the model’s level of uncertainty about that choice.

Now after explaining both of them, let’s take an example to illustrate what we have learned above:

Hard Predictions:
In a text completion task, the teacher model is a language model, and the input is an incomplete sentence like, “The capital of France is ____.” A hard prediction from the teacher model might look like this:

  • Teacher Model’s Hard Prediction: “The capital of France is Paris.”

In this example, the teacher model directly produces a complete and discrete answer, “Paris,” without indicating any level of uncertainty.

On the other hand when we come to Soft Predictions:
In the same text completion task, the teacher model can produce soft predictions in the form of a probability distribution over possible words for the blank. The soft prediction might look like this:

  • Teacher Model’s Soft Prediction:
    • “Paris: 0.8”
    • “Lyon: 0.1”
    • “Marseille: 0.05”
    • “Toulouse: 0.03”

In this case, the teacher model provides a probability distribution, showing that it’s highly confident that “Paris” is the correct completion (probability 0.8), but it also considers other possibilities to a lesser extent. This conveys the model’s level of uncertainty.

I hope it’s more clear for you now and feel free to ask for more clarifications anytime.

Thanks Jamal. Hard and Soft predictions distinction is clear. The slides and videos mention that they are both output of the student model not the teacher model. Are they both then fed to the student model for a training epoch? One is a discrete and the other is probabilistic values. How are they fed into the student model during training and what retraining algorithm is used?

1 Like

Hey @saileshbaidya,

In LLM optimization through distillation, both hard and soft predictions are used during the training of the student model.

  • Hard Predictions: These are typically used as direct labels. The student model’s objective is to replicate the exact outputs provided by the teacher model as hard predictions.
  • Soft Predictions: These are used to provide additional information to the student model. Instead of serving as direct labels, they guide the student model towards producing outputs that align with the probability distribution provided by the teacher model.

The specific retraining algorithm can vary but often involves using mean squared error (MSE) loss for soft predictions or cross-entropy loss for hard predictions. The choice of algorithm depends on the implementation and specific goals of the distillation process.

Agree, the choice of optimization function will vary depending on the problem. However, here we are trying to optimize both the hard and soft predictions. It’s easy to define the criteria for convergence for single optimization, but I wonder how we will define the stop criteria or convergence criteria when we are running optimization against both of them. In other word, how can we reach global minima for both at the same time? Or is there a tolerance that we define?

1 Like

Hello @saileshbaidya,

Optimizing both hard and soft predictions in a machine learning or optimization problem can be a complex task, and the criteria for convergence or stopping conditions will depend on the specific problem and optimization techniques being used. To address this, you typically need to carefully define your objectives, constraints, and trade-offs.