C3W3 - Knowledge distillation - Weighing of losses


Regarding the KL divergence slide, the given formula is L = (1-alpha) L_H + alpha L_KL.

I understand that:

  • L_KL is the KL divergence loss, which should be related to the distillation loss.

  • L_H is the loss from the hard labels, which should be related to the student loss.

In such a case, shouldn’t the formula be L = alpha L_H + (1-alpha) L_KL, if we want to be consistent with the code given in the Knowledge_Distillation lab?

Was thinking about the same thing.
This somewhat brings confusion when answering the quiz, as the quiz is based on the lecture slides, but I am recalling from the codes in the lab.
Either way is fine, to be honest, but please be consistent.