“Distillation refers to the student model outputs as the hard predictions and hard labels.”
From the video (~4:30): “model-optimizations-for-deployment”
Description of distillation process is confusing to me. The video describes the output of the Teacher as “soft labels.” From what I understand, the Student is then trained to minimize loss between the “soft labels” and the “soft predictions.”
I’m not sure where the “hard labels” that are “output” from the Student model come from. My belief is the “hard labels” are actually the “soft labels” from the Teacher after done being used to train the Student and the “hard predictions” are the Student’s predictions when training is complete.
Can anyone confirm this?