Hi,
I need help to understand how training samples are created. For example, given an English-German sentence pair, how many training samples are be created out of it? What is the model input/output and label corresponding to each training sample?
Hi @Peixi_Zhu
If you are talking about C4 W1, then you can consider a pair of (one sentence in English and one sentence in German) as a single example. (You can play around with the third cell from the top to get an idea what kind of sentences these are:
train data (en, de) tuple: (b’In the pregnant rat the AUC for calculated free drug at this dose was approximately 18 times the human AUC at a 20 mg dose.\n’, b’Bei tr\xc3\xa4chtigen Ratten war die AUC f\xc3\xbcr die berechnete ungebundene Substanz bei dieser Dosis etwa 18-mal h\xc3\xb6her als die AUC beim Menschen bei einer 20 mg Dosis.\n’)
eval data (en, de) tuple: (b’Subcutaneous use and intravenous use.\n’, b’Subkutane Anwendung und intraven\xc3\xb6se Anwendung.\n’)
Each sentence in a pair are tokenized into numbers. For example:
Single tokenized example input: [ 2538 2248 30 12114 23184 16889 5 2 20852 6456 20592 5812
3932 96 5178 3851 30 7891 3550 30650 4729 992 1]
Single tokenized example target: [ 1872 11 3544 39 7019 17877 30432 23 6845 10 14222 47
4004 18 21674 5 27467 9513 920 188 10630 18 3550 30650
4729 992 1]
Which is equivalent to:
Single detokenized example input: During treatment with olanzapine, adolescents gained significantly more weight compared with adults.
Single detokenized example target: Während der Behandlung mit Olanzapin nahmen die Jugendlichen im Vergleich zu Erwachsenen signifikant mehr Gewicht zu.
Cheers
Thanks, @arvyzukai
So the input are 2 lists of tokens corresponding to English and German sentences, correct?
What is the output of the model for 1 training example? For your example, the German sentence has 27 tokens. So the output is a matrix with 27 rows and 33000 columns. Each row has 33000 elements corresponding to the probability of every work in the vocabulary. Is my understanding correct?
Also could you explain how the mask is applied during training? Is it effectively not doing anything?
Yes @Peixi_Zhu you can say that the input is a 2 lists of tokens. Just to mention one additional thing - training is usually done with mini-batches - meaning there are for example 32 pairs of lists for each model weights update.
Well, because of the needed padding, 27 tokens would become 32. That would result of the output to be - 32 x 33000. But your understanding is correct.
The mask in this case would be 0 for the 5 tokens that were needed for 27 to become 32. So there would be 27 ones, and 5 zeroes ([1, 1, 1, … , 0, 0]).
When the model make predictions (for 32 tokens), the predictions are multiplied by the mask, this essentially makes the loss on padding tokens to be 0 (model is not penalized or rewarded for predicting padding tokens). This way model only “trains” to correctly predict tokens that have mask of 1.
Cheers
P.S. you might be interested in this post which explains the next_symbol
function in more detail.