How do we calculate loss in transformer architecture like for BERT, T5, etc.,. since the labels are categorical?
Perhaps by using a categorical cross-entropy cost function.
1 Like
As Tom correctly answered, language models train by minimizing cross-entropy loss (it doesn’t matter transformers or not). So yes, both T5 and BERT and other transformer architectures for language modeling minimize cross-entropy loss and the reason is as you mentioned - categorical labels (model outputs probabilities for categories).
Cheers
1 Like