I don’t understand what the code snippet tokenizer.texts_to_sequences([line])[0] in the Lab 2 of Course 3, Wk4 is doing. According to the Tensorflow API docs tokenizer.texts_to_sequences() method should return a list of sequences. My understanding of the API docs was that it should return a list of lists of tokenized sentences. However, when I printed out what it returns I got:
-
when the first element indexed (i.e. extracted with [0]) it looks like it generates a bunch of tokenized sentences each packaged in its own list or, in other words, 1D vectors of shape (n,);
-
when not indexed - it generates a bunch of tokenized sentences each packaged in its own 2D vector of shape (1, n).
Also, it is not clear why we need to tokenize and extract one line (i.e. the first line [0]) only by
# Tokenize the current line
token_list = tokenizer.texts_to_sequences([line])[0]
The 0th dimension of the input to the tokenizer represents the batch size. So, think of invoking the tokenizer with input [text1, text2, text3, ...]
. The output format is again
[ [text1_token1, text1_token2, text1_token3, ...],
[text2_token1, text2_token2, text2_token3, ...],
[text3_token1, text3_token2, text3_token3, ...],
...
]
Since we have only 1 line, we wrap it into a list to convert it to batch size of 1. From the output, we get the tokens for the single line by indexing into the 0th position of the list of token lists.
1 Like
Thanks, balaji.ambresh! I have just started the program. assign. for Wk4 and found out that it is also very well explained there! I wish it was explained like that in the Labs or in the video!
But I am still not sure if I understand it correctly why we need to get the tokens for a single line by indexing into the 0th position of the list of token lists?
Is it because we are iterating over each element (“line”) of the corpus, one line at a time, and in the returned list of lists of sequences/tokens (i.e. [ [ …] ]) there is only one list at the position [0] anyway? So by indexing into the 0th position we are effectively getting rid of the external list wrapping (i.e. the external brackets [ ])!?
You got it. We want to generate a dataset for predicting the next word given the previous words. We use a sliding window method to create training / test examples from a line and hence this way of doing things.