Wk 4, Lab 2: token_list = tokenizer.texts_to_sequences([line])[0]

Robert_Ascan · March 4, 2023, 11:41pm

I don’t understand what the code snippet tokenizer.texts_to_sequences([line])[0] in the Lab 2 of Course 3, Wk4 is doing. According to the Tensorflow API docs tokenizer.texts_to_sequences() method should return a list of sequences. My understanding of the API docs was that it should return a list of lists of tokenized sentences. However, when I printed out what it returns I got:

when the first element indexed (i.e. extracted with [0]) it looks like it generates a bunch of tokenized sentences each packaged in its own list or, in other words, 1D vectors of shape (n,);
when not indexed - it generates a bunch of tokenized sentences each packaged in its own 2D vector of shape (1, n).

Also, it is not clear why we need to tokenize and extract one line (i.e. the first line [0]) only by

# Tokenize the current line
token_list = tokenizer.texts_to_sequences([line])[0]

balaji.ambresh · March 5, 2023, 9:55am

The 0th dimension of the input to the tokenizer represents the batch size. So, think of invoking the tokenizer with input [text1, text2, text3, ...]. The output format is again

[ [text1_token1, text1_token2, text1_token3, ...],
  [text2_token1, text2_token2, text2_token3, ...],
  [text3_token1, text3_token2, text3_token3, ...],
...
]

Since we have only 1 line, we wrap it into a list to convert it to batch size of 1. From the output, we get the tokens for the single line by indexing into the 0th position of the list of token lists.

Robert_Ascan · March 5, 2023, 10:09am

Thanks, balaji.ambresh! I have just started the program. assign. for Wk4 and found out that it is also very well explained there! I wish it was explained like that in the Labs or in the video!

But I am still not sure if I understand it correctly why we need to get the tokens for a single line by indexing into the 0th position of the list of token lists?

Is it because we are iterating over each element (“line”) of the corpus, one line at a time, and in the returned list of lists of sequences/tokens (i.e. [ [ …] ]) there is only one list at the position [0] anyway? So by indexing into the 0th position we are effectively getting rid of the external list wrapping (i.e. the external brackets [ ])!?

balaji.ambresh · March 5, 2023, 10:39am

You got it. We want to generate a dataset for predicting the next word given the previous words. We use a sliding window method to create training / test examples from a line and hence this way of doing things.

Topic		Replies	Views
Token_list = tokenizer.texts_to_sequences([line])[0] Natural Language Processing in TensorFlow week-module-4	3	360	March 4, 2023
Get_padded_sequences Natural Language Processing in TensorFlow week-module-1	6	537	December 23, 2022
I am getting error in tokenizer Natural Language Processing in TensorFlow	12	315	January 17, 2023
C3W3 Assignment - Sequences and Padding Natural Language Processing in TensorFlow week-module-3	4	267	June 7, 2023
Recommended way to tokenize new code Natural Language Processing in TensorFlow week-module-1	1	1819	December 13, 2023

Wk 4, Lab 2: token_list = tokenizer.texts_to_sequences([line])[0]

Related topics