Course 2 Week 1 Programming Assignment Regularization

In this programming assignment, after implementing drop-out, train accuracy becomes less than test accuracy. Is this accepted in general? My perception so far is that test accuracy can maximum reach train accuracy but never be more than train accuracy. Or is there an acceptable range where test acc can be greater than train acc?

Any help in this regard will be helpful in getting a better understanding.

Hi @kavuriananthasai,

Welcome to the forums!

Interesting question. They way I see it, when using dropout training loss could be higher (accuracy lower) because you are just vanishing random units, making it harder for the network to learn from the test samples. This kinda makes sense, because dropout helps avoid overfitting, without really tweaking the train set data at all. Fighting overfitting, will lead to less accuracy in training but higher accuracy in the test set because:

  1. you use the full network (no dropout) and
  2. the model generalizes better because you addressed the overfitting.

There might be some other anciliary reasons why this would happen, like when train examples are just harder that test samples, but I don’t think this is super common. But in dropout I think it’s common.

As for the acceptable range, well, I think one of the takeaways of this course is that there aren’t 100% accurate recipes and all cases are different. If your model tests and performs well in the dataset distribution you need it to, then I think you can overlook train accuracy. If it doesn’t perform well, try getting more data and make sure you re-sample the datasets to try to get more balanced ones across train/dev/test sets.

Hope that helps!

4 Likes

Hi @neurogeek ,

Thanks for your quick, crisp and simple answer.

Based on your reply, I can understand that we need not be surprised to see the neural network performing better on dev/test set especially when we use dropout.

@neurogeek’s excellent answer has completely covered the issue here, but there is one caveat about any conclusions you derive from running experiments using the code that we have written here in the Regularization notebook:

Notice that the template code that they gave us sets the random seed in the forward propagation with dropout function. That means that we actually end up dropping exactly the same neurons in every iteration, so what we have implemented here is not really dropout as it was intended. The whole point of dropout is that you don’t drop the same neurons in every iteration, so that you get statistical behavior w.r.t. weakening the connections. So if you want to run experiments using the code as we wrote it here, it would be a good idea to remove the setting of the random seed so that you really see the full effect of dropout. With the fixed seed, our dropout implementation basically just subsets the network in a fixed way, which is not the real point of dropout. In fact, thinking a little more deeply about this, the “fixed dropout” is actually worse than just subsetting the network. That’s because what happens is that the fixed set of neurons we drop have weights that just never get updated, but then when we use the trained model, those weights are still there with whatever random initial values they happened to get and the training has not affected them at all. That’s because the way the dropout logic works on back propagation is that it zaps the gradients for those nodes. So it seems pretty important that we remove the setting of the random seed for any real work we intend to do with this code.

Of course this point is completely independent of the “big picture” answer that @neurogeek gave us above.

5 Likes

Hi @paulinpaloalto ,

Thanks for your answer.

Actually I got the following question about dropout while hearing Professor Ng’s lectures:

Let’s say we have a 100 neuron layer and we are dropping 20% neurons everytime we run an iteration and end up with 95% dev set accuracy. Can we expect 95% dev set accuracy when we train the model with 80 neurons without implementing dropout? (I know it’s slightly off my actual question, still it was bothering me for quite some time until I read your answer)

However, your answer made it clear that model generalization improves only when we drop different neurons for every iteration. Otherwise generalization would be either worse than that of 80 neuron model or in the best case, similar to 80 neuron model.

While I’m implementing the dropout on a real problem, as you mentioned, I would make sure that dropout is implemented as you mentioned.

Best regards.

1 Like

Interesting questions! I don’t definitively know the answer to your 20% dropout versus a non-regularized network with 80% of the neurons. My intuition is that the two cases are not equivalent and I would not expect the 80% non-regularized network to work as well as the 100 neuron regularized network. The point is that dropout (or any other form of regularization) only happens during training. When we actually use the network to make predictions, we are not doing dropout, which means we are using all 100 of the trained neurons. But they are trained in such a way that their weights have been adjusted so that they do not “overfit” on the training data. So the 100 neuron network is a more powerful network, but with the benefit of reduced overfitting. I would expect it to perform better on the test or dev set than the 80 neuron unregularized network.

But this is just my intuition. This is an experimental science: you could actually try this and see what happens. If you do, it would be really interesting to know what conclusions you can draw. Science! :nerd_face:

1 Like

@paulinpaloalto - I also definitely agree with your point that it’s worth experimenting along these lines. I will surely post my results here once I have definitive understanding of them.

I am planning to test this understanding on common open-source datasets like MNIST / CIFAR-10. Any additional inputs on the approach or which datasets to start working on would really help.

Best regards.

It’s a great idea to start with one of the standard datasets as a basis for this kind of learning and experimentation. There are lots of choices in addition to the ones you mention. ImageNet and Kaggle also are rich sources of such data. One big consideration is how much compute resource you have available. One nice thing about MNIST is that the images are small enough (grayscale 28 x 28) that it’s a lot more tractable if you’re going to be running the training on your own computer without a big GPU setup. For most of those (MNIST in particular), you’ll need to implement softmax for the output layer, if you’re using the python implementation of the nets from Course 1 Week 4. Of course that’s no problem if you’re using TF or PyTorch.