I may not fully follow when Sharon talks about Teacher forcing and fused kernel, although I did try to chatgpt to find the meaning of those two terms:
- Teacher forcing - just directly pass in the True tokens from previous outputs during Training time
- Fused Kernel - an example is cross-entropy loss usually has a fused kernel combining softmax and logprob operations together for efficient processing.
I’m a bit confused as to where/how Sharon is depicting how those two relates to the example she’s going through in her video. Did any one catch that?