Replicating C3W3 and C3W3 assignments

Hi there and happy NY to everyone. So, in the last two days I tried replicating the assignments (locally and on kaggle) for W2 and W3 (which I passed in the Coursera environment) with publicly available datasets and a super simple BI-LSTM. However I systematically and very badly overfit with these replicated datasets. No amount of hyperparam tuning or changing the architecture makes any difference. Any idea why that might be the case? Maybe I am just missing something silly? Here’s a sample kaggle notebook notebookdcd4eb2652 | Kaggle

Deep learning specialization goes into techniques for avoiding overfitting. Please check it out.

How did you pick 160K records as a hyperparameter to fit your dataset?

Given that there are 1.6 million rows, test split of 20% seems a lot. A smaller split of say, 1% is a good place to start.

The way you’ve divided the original dataset doesn’t include stratify parameter. Using the label field for this parameter will provide balanced distribution of labels across both train and test sets.

Another way to improve model performance is to use pre-trained word embeddings instead of training the embedding layer from scratch. Checkout transfer learning and fine-tuning.

I am just unsure why exactly the same hyperparams that worked well in the Coursera env would fail so badly in a supposedly similar dataset

It seems like you are referring to C3W2 and C3W3 assignments. If so, please fix the title of this topic.

Mentors don’t have access to the grading infrastructure. So, I don’t know how randomness is accounted for at grading time. In tensorflow 2.7, tf.random.set_seed is the closest you can get towards reproducibility.

While the model architecture plays a major role in balancing the bias / variance tradeoff, we should look at the underlying dataset as well:

  1. C3W2 assignment deals with bbc text. This doesn’t resemble a tweet dataset which is quite different in terms of vocabulary and usage of words.
  2. C3W3 is about using pre-trained embeddings ways to avoiding overfitting. That said, the passing criteria for this assignment is that the slope of val_loss curve should not be high. There is no specification on overfitting. In your kaggle notebook, there is no usage of pre-trained embeddings.

The suggestions I gave you earlier was to build a model using the whole dataset.

A few things to keep in mind for NLP problems are:

  1. Vocabulary size matters: If a lot of important terms in the test set aren’t replaced by OOV tokens, odds of a good prediction is high.
  2. Quality of embeddings: Good embeddings reduce training time.

Thank you, fixed the title. I just don’t understand how this could go so badly locally, is all. The datasets are different only in their source, I assume.

Course 3 week 3 assignment uses glove embeddings.

It’d be helpful if you shared your notebook via a direct message to me.

2 Likes

Thanks for the notebook. Couple of points:

  1. Please fix the topic title.
  2. The kaggle link you’ve shared uses sentiment 140 dataset which is part of C3W3 assignment. The notebook you’ve sent me is for C3W2. Is this a mistake?
  3. Moving forward, don’t send data if it’s part of the assignment.

Hi, thanks. I did fix the title.

i sent the data because that’s the only difference from the original assignment. As I explained I passed the assignment but I am trying to recreate it locally.
Basically I am trying to do all the assignments from the tf specialization (which I completed) locally.
You are correct that I did not also send the notebook from c3w3 which I tried to recreate on kaggle. I thought you might want to start from C3w2

Thanks again

C3W3 is present twice in the title. Please fix it.
Since your kaggle notebook is closer to C3W3, message me your C3W3 work.

Did you read my note on your kaggle notebook skipping the word embeddings?

Never mind, I managed to fix both. There were a couple of problems in the second one, and for the first, I found another version of the BBC news dataset that worked well. Maybe I will try my hand with the other version of that dataset to see what was wrong. Thanks a bunch, nonetheless!