Hello everyone,
i got stuck within the first assignment. The parse_data_from_file function works as expected but the fit_tokenizer gave me an output of
Vocabulary contains 30888 words
token included in vocabulary
Inside the function the Tokenizer gets only the OOV parameter.
As a consequence the next function get_padded_sequences gave me this output.
First padded sequence looks like this:
[ 0 0 0 … 87 87 8]
Numpy array of all sequences has shape: (2225, 2439)
This means there are 2225 sequences in total and each one has a size of 2439
With a mismatch in the number sequence and in the desired size of the array (2439 instead of 2438). Even if i try to do a post padding the numbers are not correct and obviously no change in shape.
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token=“”) // the OOV is not displayed but in the code.
I’ve checked the parse data function. I called the remove stopwords only for the sentences and not for the labels. Just for testing i added the function call to the labels with no effect on the vocab size nor the shape.