Greetings to everyone
Hi @paulinpaloalto and @sjfischer (and of course everyone else).
I’ve been playing with Week 4 (Ass 2) deep NN with both parameters and hyperparameters and I have some questions. First, I’ve tried this 6-Layers NN: [12288, 100, 100, 20, 75, 10, 1], and well, the pred_train accuracy was 0.9999999999999998 but the pred_test accuracy lowered to
0.74 (worse than before changing anything). Are we closing to overfitting? (Or reached it already?)
Regarding now the learning rate, I’ve changed it to 1 in the two_layer_model, and when I run plot_costs(costs, learning_rate), I’ve got this:
Cost after iteration 0: 0.693049735659989
Cost after iteration 100: 0.6439737380528059
Cost after iteration 200: 0.6439737380528059
Cost after iteration 300: 0.6439737380528059
.
.
.
Cost after iteration 2300: 0.6439737380528059
Cost after iteration 2400: 0.6439737380528059
Cost after iteration 2499: 0.6439737380528059
We notice that the cost function get stuck at the 0.643 value. Why is that?
Thank you.
Hi, @Ayoub. It is great that you are trying these kinds of experiments! You always learn something when you try things like this yourself.
Yes, the first case of using the 6 layer model is the textbook definition of “overfitting”. You’ve got basically perfect predictions on the training data and relatively poor predictions on the test data. In fact, your test accuracy is actually worse than we get with the 4 layer model they give us. I’m not sure that makes sense, so it’s possible that you need to fiddle with the learning rate and number of iterations. Of course if your train accuracy is already perfect, it’s an open question whether you can actually get any better results on the test data. One question would be whether the cost value continues to go down if you run more iterations or try a slightly more aggressive learning rate. But the bottom line may be that your 6 layer network is somehow “overkill” for this problem or (maybe more likely) that there’s no way to get better results without more training data. The other experiment that would be interesting is to try a 5 layer network with [12288, 100, 20, 7, 5, 1] and compare the results you get between the 4, 5 and 6 layer nets.
For the 2 layer experiment, that’s a pretty radical change to go from the learning rate of 0.0075 (the default) that they used in the notebook to an LR of 1. If you want to experiment with LR, try a range of values between 0.0075 and 1 using factors of approximately 3. E.g. 0.02, 0.06, 0.18, 0.54.
But first try to see why the cost doesn’t change with LR = 1. Is something wrong in your code? If you passed all the tests and the grader, than that is probably not the problem. What are the gradient values you are getting? What happens in the “update parameters” step: are the weights actually changing? You need to dig a little deeper to understand the behavior you are seeing.
First, I shall say that my experiments were just about to play with parameters and hyperparameters (nothing more). But now that some serious things are about to be learnt, I shall try the propositions you suggested (and more).
Now about the learning rate, I will try later the values you proposed (and I will give a feedback later). On the other hand, there are no mistakes in my code (as I’ve already had 100% score and all the tests passed); I used again the 0.0075 value to double check, and everything is ok. So there is something about getting a constant value of the cost function when using LR = 1! What’s this thing? What’s going on? I don’t know yet! But mathematically speaking, I think that in every iteration of GD, the parameters stay the same, which means that the quantities LR*gradient (used in the update rule) are always 0 (in every iteration). But LR = 1 (different from 0) which tells that whether the gradients are 0 (which is bizarre) or that they are very small! Yes? No? Is my intuition right? I don’t know.
Now about the [12288, 100, 20, 7, 5, 1] layer you proposed, I’ve tried it, the results I’ve got are pred_train= 0.9999999999999998, and pred_test= 0.74 which are the same values obtained from my (arbitrary choosen) 6-Layers NN. So apparently, the best choice for this particular problem is a 4-Layer NN! (But after I learn about regularization, I shall go back to this assignement and re-try any (>4)-layer NN I want).
Finally, I’m trying to feed my the dataset with new training examples (the one I’m trying now as tests and that are incorrectly classified) but I couldn’t yet manage to do such a thing (problems with working with h5 files and so on)!
In order to understand why the cost does not change, you need to delve deeper into the details of what is happening. Exactly as you say, LR = 1 is not 0, so the update formulas should work fine and change the values. The only way I can think of that wouldn’t happen is if the gradients turn out to be 0 in that case. So why would that happen? It’s possible that you are “saturating” sigmoid because of the high learning rate, but the only way to find out is to dig into this and put in some instrumentation to figure out why the parameters are not changing. E.g. add an assertion in the backprop code that the dw and db values are not all zeros and see if that “throws” or not.
" add an assertion in the backprop code that the dw and db values are not all zeros and see if that “throws” or not", I didn’t actually understand that.
The theory for why the cost doesn’t change is that the gradients are zero, so the question is how to confirm that is the cause. Note that in two_layer_model, we actually have the 4 gradient values as local variables after the back prop steps. So you can insert the following statements as an example:
ndW1 = np.linalg.norm(dW1)
assert not np.isclose(ndW1, 0.0), f"Near zero gradient for W1 norm {ndW1} at iteration {i}"
That takes the 2-norm of the gradient for W1 and checks if it is close to 0 or not. If it is close to zero, the assert will throw an exception, which stops everything and prints the iteration number. Here’s what I get when I run the two layer training with LR = 1.0:
AssertionError: Near zero gradient for W1 norm 0.0 at iteration 56
So that confirms the theory that the problem is that the gradients are zero at least for W1. This just “peels the onion” by one layer and now we have to dig deeper to figure out why the LR value causes zero gradients.
Mind you, it’s not clear how much we will learn by digging deeper on this. At a higher level, the point is just that there is such a thing as too high a learning rate, so the straightforward thing we can say is that 1.0 is not a good choice of LR for this problem. It’s more productive just to try some other values on the spectrum and see if you can find another that performs better than the default 0.0075.
Oh, sorry, I thought you were making a joke by using 1 and “one” to mean two different things: We should spend our time investigating other LR values besides this one (meaning the current value) which happens to actually be 1.
I meant it as a joke.
I understand now the meaning of “numerical pun” (In fact, I’m not a native speaker, so sometimes I find difficulties in understanding some “spoken language”)
That’s great! So I correctly “caught” your joke. In English a “pun” is a type of play on words where you make a joke from the fact that the same word can mean different things in different contexts, so that’s the “technical” name in English for that type of joke.
Aha! Great! I even know now the translation of the word “pun” in my language!
Fruitful conversation both scientifically and linguistically! I shall buy you another coffee .