Loss functions

How do we calculate loss in transformer architecture like for BERT, T5, etc.,. since the labels are categorical?

Perhaps by using a categorical cross-entropy cost function.

1 Like

Hi @Hafsa_Farooq

As Tom correctly answered, language models train by minimizing cross-entropy loss (it doesn’t matter transformers or not). So yes, both T5 and BERT and other transformer architectures for language modeling minimize cross-entropy loss and the reason is as you mentioned - categorical labels (model outputs probabilities for categories).


1 Like