I have watched the video multiple times now, and I understand that layers with a bigger number of neurons and connections have a bigger number of parameters and that makes it tend to overfit.
but my question is: if I have many big layers that have the same number of neurons, how would I know which layer is causing the model to overfit? is there a way or a tool that monitors layers individually and calculates their overfitting tendency?
It is an interesting question, but I have never heard of any technique that can directly measure something like that on a “per layer” basis. The point is that you only can measure overfitting by using the final output of the whole network. But now that you mention this idea, maybe there is an indirect way to infer this kind of information. It’s been a while since I watched these lectures, so I forget exactly what Prof Ng says, but I do remember that he mentions that you don’t have to use dropout on all the layers and you probably wouldn’t on the output layer. So here’s an idea that occurs to me based on your question: just go through and implement dropout only on one layer at a time and then run experiments with the “keep probability” and see if there is one layer that has more influence than others. Or use dropout on all but the output layer and experiment with using higher or lower “keep” values on different layers and see if you can see which ones need the lower keep values to be effective at reducing overfitting. Of course that sounds like a lot of work, since there are lots of degrees of freedom there and you need to run the training everytime to assess the results.
If you are inspired to try anything like that, it would be really interesting to hear what you discover. Thanks for bringing this up!
Your idea got me really excited, so I used a model I built before using TensorFlow to do some testing, readings, and comments on Dropout behavior and effect. you can find more details in the text file attached and when I get to the dropout assignment of the course I am planning to do some more tests.
Results.pdf (14.4 KB)
Interesting! Thanks very much for sharing your results. Just a couple of questions: in the first paragraph, you list the dropout rate as 0.2. I assume you mean keep_prob = 0.8
, right? But in the other places where you use 0.7 and 0.6, I assume that is also keep_prob
and not 1 - keep_prob
.
It looks like in the last paragraph you get the most balanced results with keep_prob = 0.6 on just your layer DP3. At least in that case, you get the closest match between training accuracy and validation accuracy. But then the problem is that 95% accuracy is probably not going to be good enough. So maybe you need to adjust some other hyperparameters like layer sizes and then do the dropout tuning. Just a thought. Mind you, I have not yet had time to look at your full model on GitHub. I probably won’t have time to do that for a while (travelling at the moment).
Thanks for your time and ideas, in the results file (dropout rate == 1- keep_prob) whenever it is mentioned. and you are correct I am getting the best result when the dropout rate = 0.6, but now I am thinking that maybe I have an excessive number of neurons on the layer causing such a high dropout rate to be useful.
the points you are raising are interesting, and I will be doing more tests and updating the topic whenever I have new results.
thanks, and have a safe trip.
Yes, I would agree that using a dropout rate of 0.6 (keep_prob = 0.4
) seems pretty extreme. If that works well in the DP3 layer only, your theory that it’s actually telling you that layer has too many neurons sounds like the most plausible explanation. Worth some experimentation anyway. Let us know if you get any further insights on this. Thanks!
in the past two days, I have worked on more tests, and what I found was:
-
for DP3, cutting the size of the convolutional layer before it in half will make a dropout rate of 0.3 more useful than 0.6.
-
and I found that if I do not use dropout after all and every Dense layer a lonely dropout layer does not affect the model output. which I find surprising.
so if you know a research paper or another reference that explains the result above, supports them, or denies them please mention it.
FurtherTesting.pdf (14.2 KB)