I have tried all the different layer options (LSTM, Convnet, GRU) and have tested various hyperparameters, but I cannot get it to pass the slope test. I can get very close with GRU (0.0011), but doesn’t pass the test which requires 0.0005.
I’ve reviewed my previous cells to see if there are any mistakes. All the cells pass and match the expected output, except the vocab size. The length of tokenizer.word_index should return 128293, but for me it returns 118760. And I imagine this can contribute to the overfitting as the vocab list is shorter causing more words to be oov. My loss curve was getting really close with GRU so I suspect if I can fix this vocab size issue, I can get it to pass the slope test.
Am I going in the right direction here? If so, I would appreciate any pointers you can provide to fix the vocab size issue.
Looks like I misunderstood what tokenizer does. I did not initially use num_words in initializing the tokenizer, so this time I used 10000 for num_words. Unfortunately, setting num_words to 10000 doesn’t help me reach the slope value of 0.0005. Does num_words value matter to reach the desired slope value? I would imagine higher the total words in the training corpus, the top 10000 words tokenizer uses will get updated accordingly, but it’s not clear how much of that actually helps with lowering validation loss since validation is data that has not been seen by training data. So is num_words something that needs to be tweaked to reach the 0.0005 slope or I should be looking somewhere else entirely?
are you talking about the below cell?
GRADED FUNCTION: fit_tokenizer
def fit_tokenizer(train_sentences, oov_token):
Can you share an image that includes your output with expected output without sharing any codes.
I second @Deepti_Prasad on sharing an image of the output and expected output up to the model. (Note: Please do not share any code)
Other than that,
Here is the sequence of steps to get the required length of tokenizer.word_index :
first, extract sentences and labels,
There are 160000 sentences and 160000 labels after random sampling
Second, splitting the dataset into Training and Validation,
There are 144000 sentences for training.
There are 144000 labels for training.
There are 16000 sentences for validation.
There are 16000 labels for validation.
Vocabulary contains 128293 words
<OOV> token included in vocabulary
index of word 'i' should be 2
please share your output of these steps first, then we can guide you to get the required slop.
Hello, here are the requested outputs. My output at top followed by expected otuput.
This is the output of parse_data_from_file
This is the output of train_val_split
This is the output of fit_tokenizer
Hi Deepti, Yes that is the cell where I am getting an output that is different from the expected output. I’ve shared screenshots of my output below as a reply to alotaibit.
Make sure you are recalling the below statement by the correct code
Fit the tokenizer to the training sentences
Check if you used Correct Tokenizer(it can be a minor difference in alphabets too) code recalls and Fit the tokenizer code to the training sentences.
if not able to find, share a screenshot of the codes via personal DM. Do not post codes here. Click my name and then message.
Do not include num words while passing the values to oov token