I may not fully follow when Sharon talks about Teacher forcing and fused kernel, although I did try to chatgpt to find the meaning of those two terms:
- Teacher forcing - just directly pass in the True tokens from previous outputs during Training time
- Fused Kernel - an example is cross-entropy loss usually has a fused kernel combining softmax and logprob operations together for efficient processing.
I’m a bit confused as to where/how Sharon is depicting how those two relates to the example she’s going through in her video. Did any one catch that?
I don’t recall the specific video, but teacher forcing is the standard technique used during pre-training and supervised fine-tuning. It works by feeding the ground-truth “teacher” tokens from your training data back into the model at each step, rather than letting the model’s own predictions guide the next word.
A fused kernel is a backend optimization used to make training and inference run faster. It essentially combines multiple mathematical operations into a single GPU “command” to reduce memory overhead and speed up the computation.
I believe both are relevant terms for this course.