UNQ_C9 About the next_symbol and the model

As mentioned above. I want to ask about 3 problem:
1,We take the article and a part of the summary as input to the model and wait for the probability of the next word. So why does the output have the shape (1, padded_length, vocab_size) instead of (1,vocab_size)?
2, Can we do parallel computation when testing, if we do that the input size will be (1,batch_size, padded_length, vocab_size) ?
3, The decoder has to repeat for each word, so how could it be faster than CNN? Just compare speed. And what about the “true transformer” instead of the model we implement in this assignment when it has to encode before decode.
that the image i croped from notebook


Thank you all.

Hey @20020069_Le_Thai_S_n,
Welcome, and we are glad that you could become a part of our community :partying_face:

First, I would like to thank you for creating this thread. I also learnt something new, while trying to curate the answer to your thread.


This is because the model outputs the probabilities corresponding to each of the positions, i.e., if we have padded_length = 100, it will output the probabilities corresponding to each of the 100 positions (or tokens). And the model has been structured to do this exactly, since in order to compute the loss for each of the tokens during training, we need the probabilities corresponding to each of the tokens. I hope this makes sense now.


I don’t think it should be an issue. It’s quite analogous to CNNs being used on multiple examples simultaneously, during inference. We just need to make sure that the padded_length for each of the examples in a single batch is the same. And I guess that should be it.


As to the speed comparisons between transformers and CNNs, I am not sure whether this is a question that we should even think about. Note that CNNs are designed to exploit spatial information, which is not an attribute of any natural language application. On the other hand, sequence models and transformers are designed to exploit temporal information, which happens to be a key attribute of every natural language application.

If you really think this is a question of concern, feel free to train a CNN based architecture and a transformer based architecture for any natural language application, and you can decide for yourself, whether you want a faster CNN-based architecture with a huge drop in the performance or not. Please do share your results with the community.


Honestly speaking, I never thought about this at all prior to your question. I checked this article out, and in that, Text Summarization was included as an application for the Encoder-Decoder architecture. So, what are we missing here? Turns out @arvyzukai has already posted an answer to this question, which you can find here.


Let us know if this helps.

Cheers,
Elemento

1 Like

Thank you for your previous answers, You did help me a lot. But still, there is some cloud left in my mind.
At question 1, I still don’t understand the shape yet. Why do we need to compute the probability for all the tokens that we already know? And if we do that, it means we only use the article for the computation; what is the summary for?

Another question:
I just noticed that we need to pass the embedding layer with a huge article many times. Is this because of the lack of an encoder?
I also remember that Jonas said something about weight loss = 0 for the input. What does it mean?
Cheers.
Son

Hey @20020069_Le_Thai_S_n,

I believe that my previous response to this was a bit ambiguous. Let’s say that the padded_length = 100 for all the examples in our dataset. Now, consider the training process. Say the first example has article_length = 50 and summary_length = 20; the second example has article_length = 40 and summary_length = 15; and so on.

Since the model needs to work with a batch of examples, most of which are likely to have different lengths of articles and summaries; hence, the model can’t produce probabilities from a fixed position (in the first example, it is 51st, in the second example, it is 41st). Hence, we structure the model to produce probabilities for each of the positions, however, we only use the probabilities corresponding to the summary tokens, to compute the loss.

So, the thing to note here is that, although we compute the probabilities for the article tokens as well, we don’t use them anywhere, and hence, the shape of the output. I recommend you to go through the section 1.2 of the assignment again, it discusses this point in a great depth.


If you understand the answer to the previous question, this becomes trivial to answer. Here, I believe that by “input”, Jonas was referring to the articles. Let me quote something from the assignment as well:

The loss is taken only on the summary using cross_entropy as loss function.


Let me pose you a question, and I guess the answer to this will become trivial to you. In the presence of an encoder, if we have huge articles, do you think that we won’t need to pass them through an embedding layer?

If you still are confused about this, then I recommend you to read about embedding layers and what they do, once again.

Hint: Embedding layers are used in both: Sequential networks like RNNs, GRUs, LSTMs; and Transformer-based networks.

Cheers,
Elemento

Hi.

I reread all the assignment again, and I realized that the shape of output is (1, paded_length, vocab_size) or (batch_size, paded_length, vocab_size) , where token_length = article_length + some_summary_token that we output from the previous step+pad token,choosing at token_length position, because it is a part of teacher forcing.

By better guessing the article, we can better guess the summary. By using backward propagation at every output, we improve the model for every next output before it goes too far from the correct answer and take a lot of time for gradient descent if we generate all the output, compute the cost, and then use backward propagation.

Therefore, the shape (1, token_length, vocab_size) at the test and evaluation times may be redundant.

There is nothing to do with input length like you mentioned in the previous answer. That’s what I think. Correct me if I’m wrong.

But the question of masking the “generated article part” is still there. Is that for the sake of training time in this small assignment, or is this the correct way to train this model since we only care about the summary?

Cheer.
Son.

Hi @20020069_Le_Thai_S_n

That is correct.

That is not entirely correct. In this (C4 W2) assignment the mask for the article is 0. But in general that (also learning from an article) is a good idea and what people usually do they assign some small amount (like 0.05) instead of 0. But in this assignment there is a statement (before part 2):

The loss is taken only on the summary using cross_entropy as loss function.

I usually understand better with actual examples (part 1.2 - Preprocessing for Language Models: Concatenate It!):

Single example mask:

 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

while the input is:

array([11945,     7,     5,   263,   357,   372,   108,    18,   375,
         320, 15201,  1248,   213, 22549,   117, 12385,  8070,  5956,
          80,   384,     2,    35,  5904, 13559,  7189,  5435,    41,
        5290,     7,    26,   188,   412,   227,   412,   213, 11945,
         384,  1838,   137,    91,  1008,     3, 12547,  8006,  4402,
        7227,     7,     5,   384,  1435,   950,   290,   694,   869,
         809,   213,   448,   527,   213,  6268,   951,     2,    35,
         102, 17518,  2769, 16167,   214,   148,  3389,   324,   186,
         257,     2,    31, 14077, 17638,     4,    23,    46,   316,
          71,   883,  1782,   223,    28,   248,  1435,   416,   320,
        1151,   932,  1019,    28,    32,     6,   121,   146,    33,
         499,   172,   103,    64,  3898, 13559,  7189,   127,    78,
        2613,  3480,  2181,  4172, 27439,  9275,  7583,     7,   312,
          13,   980, 12547,  8006,  4402,  7227,    72,  1492,  1008,
          22,  7157,  2685,   213,   618,     8, 11945,    12,   248,
         186,     8,  1349,    12,   213,   248,    22,    40,   146,
         320,   213,   248,    22,    23,   169,   186,    22,   127,
         213, 14077, 17638,     4,     7,     5,  3706,     3, 11945,
          18,  4208,    44,   694,  1838,  1436,  2194,   824,   357,
          74,    41,   206,   132,   213,   769,   527,   645,    37,
        2255,   379, 11945,   436,   213,   789,   214,   148,  3389,
         257,   186,  3389,   324,     2,    35,  7121,   148,  3771,
       11969,     7,   312,    13,   615,   809,   213,  4692,    41,
        1435,  7797, 26096,    70,   705,   450,    41,     8,   213,
         645,    37,  2255,   248,    12,  2696,    60,     8,   132,
        6268,   951,  3771,    48,   728,   213,   357,   102,   186,
          41,    86,  4208,    72,   186,   290,   694,     8,  2119,
        5305,     7,    56,   248,   824,   357,     2,   188,   536,
          41,     7,   165,   809,    28,   456,   208,   338,     2,
          18,  2696,    60,  1147,   450,   186,   674,  4208,   290,
         694,     3,   207,     7,   183,   540,   320,   211,   320,
         285,   382,   338, 10220,   312,     8,  3389,    12,   324,
         533,   246,   320,   137,   313,    13,   742, 11945,   953,
         105,   236,   213, 22062,   186,  7356,   809,    32,     6,
         121,    61,    13,   362, 11945,   953,   257,   236,   213,
       22062,  1782,   198,     7,     5,    28, 24425,   360,  3761,
           3,  5904, 13559,  7189,  1353,  2990,    78,  9729,  2392,
          80,  2613,  3480,  2181,   344,  1248, 26431,  3072,     7,
           5, 17129,  6845,  5571,    68,   379, 11190,  2309,  9854,
         528,  4390,  3389,   257,     7,     5,  2918,     6,    55,
       11707,  9634,    58,   214, 11945,   809,  1795,  3334, 25379,
           4,   379,     9,   463,   384,    40,  1472,   320,  1151,
         132,   490,   527,   213,   282,    35,    25, 20585,   311,
        1248,   141,  4961,  2267, 11969,     7,   348,  3389,   324,
          41,   533,  1838,  1874,   318,  1641,  6259,  1019,   213,
         137,  1170,   171,   213,  1321,     2,   186,   213,   137,
        1170,   102,    41,   533,   320,   708,   318,  1641,  6259,
           2,   186,   324,    40,   137,   313,     3,   449,    49,
           7,    26,  1151,   163,  7513,  1838,   213,  1959,     3,
         449,     7,     5,    28,  3761,   132,   213,   807,  1782,
        4817,  6703,   194,     8,   214,  3389,   257,    12,    41,
         533,  1838,  2354,   318,  1641,     8,  6259,   171,  4032,
          12,   320,  1210,   318,  1641,     8,   102,  4032,    24,
         207,  8371,   236,  1782,    56,   229,    19,   213,  1959,
        2388,   103,     3,     9,   807,  1779,    18,   806,   880,
          71,    28,    32,     6,   121,   789,    18,   146,  7485,
        8352, 11585,     1,     0, 11945,  1435,   290,   694,   869,
         809,   213,   448,   527,   213,  6268,   951, 16346, 27439,
        6774,  1628, 12547,  8006,  4402,  7227,     7,     5,   384,
          18,  3187,   880,   320,  1151,   263,   878, 12891,  9001,
       16346, 27439,  6774,  1628,   200,  5904, 13559,  7189,  5435,
          77,   229,   234,   796,  1019,  3805, 16346, 27439,  6774,
        1628,     9,   492,  3389,   257, 16903, 14257,    17,    31,
        1430,   527, 14077, 17638,     4, 16346, 27439,  6774,  1628,
       11945,  4208,   694,   214,   148, 27439,  9275,  1628,  3389,
       27439,  9275,  1628,  3450,  2104,     1])

detokenized version:

Single example:

Chelsea’s early season form may have led to comparisons with the
Arsenal ‘Invincibles’ side, but Gary Neville believes they aren’t even
as good as the Chelsea side from 10 years ago. Jose Mourinho’s side
are currently four points clear at the top of the Premier League, but
after letting leads slip against both Manchester City and United,
their killer instinct has been called into question. ‘If a team are
going to be playing for a 1-0 then you better see it out,’ Neville
said on Monday Night Football. 'When I saw Jose Mourinho two weeks ago
he talked about the 2005 (Chelsea) team and (compared) the team he had
then to the team he has now and he said the killer instinct’s missing.
Chelsea have dropped more points from winning positions this season
than they did in the whole of 2004/05 . Chelsea took the lead against
both Manchester United and Manchester City, but drew both matches .
'When I look at the statistics they are staggering - 28 times they
(the 2004/05 team) scored first (in Premier League matches), 27 the
season after and they only dropped two and four points (respectively).
‘This team this season, even though they’re at a really high level,
have scored first seven times and already dropped four points. They’ve
got to get to that next level.’ 'When (Manchester) City went down to
10 men I thought Chelsea let them off the hook and yesterday at 1-0 up
I think Chelsea let United off the hook. ‘There’s a mentality shift.
Gary Neville was talking on Sky Sports’ Monday Night Football show
with Sportsmail’s Jamie Carragher . Robin van Persie scores Manchester
United’s injury-time equaliser against Chelsea at Old Trafford . The
away side had appeared to be in control of the game but were undone
with just moments remaining . 'At Manchester City they went from 55
per cent possession for the 10 minutes before the goal, and the 10
minutes after they went to 26 per cent possession, and City had 10
men. That can’t be an instruction from the manager. That’s a shift in
the players. 'Yesterday (against Manchester United) they went from 64
per cent (possession before scoring) to 45 per cent (after scoring).
They switch off. ‘This is not the manager changing it. The players who
have worked themselves into a 1-0 lead have then sat
deeper.’<EOS><pad>Chelseaare four points clear at the top of the
Premier League . Jose Mourinho’s side have proved themselves to be
early title favourites . But Gary Neville believes there is still room
for improvement . The former Manchester United defender criticised
their lack of killer instinct . Chelsea dropped points against
both Manchester clubs .<EOS>

You can see that the 1 s are only for the summary.

As for the input_length - we need to determine the input_length for batching, if the batch size is 1 that would be irrelevant, but training is usually done with mini-batch training (some internet explanation). So for that we have an entire section (part " 1.3 - Batching with Bucketing") for saving compute and fitting into memory.

Cheers

P.S. I would suggest addressing one question at a time :slight_smile: that way we could find the answers more efficiently