Can the random initialization of weights return very small values using np.random.randn((x,y))*0.001?

Please correct me if I am wrong but np.random.randn((x,y)) can generate very small values as mentioned here.

If we were to multiply this small value by 0.01 then it would lead to g(z) being small - a problem we are trying to avoid using random initialization. This problem would be more pronounced in NNs with fewer hidden units. Is that correct? If yes, how do we avoid this situation? If not, can someone please explain?

I’m not sure I understand the question. np.random.randn is the Normal (Gaussian) Distribution with \mu = 0 and \sigma = 1. That means that 99.7% of the values are in the range (-3, 3), although a few may be outside that range. If we then multiply by 0.01, that means that 99.7% of the values will be in the range (-0.03,0.03).

We need to randomly initialize the weights of our Neural Network before starting the training in order to achieve the required Symmetry Breaking. Here’s a thread which explains what that is and why it is required.

If you are going to break symmetry, there is some advantage in doing it with relatively small values. It is less likely that you’ll have problems with divergence or saturation of the output values if you start small. But it turns out that initialization is not such a simple and straightforward thing as one might wish: there is no one magic recipe that works best in all cases. It is somewhat situation dependent, so the choice of initialization algorithm is yet another hyperparameter (a choice that needs to be made by the system designer). Prof Ng will talk more about this in Course 2 of this series, so please stay tuned for that.

1 Like

Thanks for your response.

I’m just wondering whether any of those 0.3% of values not in (-0.03, 0.03) could be so small that it renders the random initialization ineffective. I guess the answer is yes but it is very unlikely.

The thread you provided was very useful. I will experiment with the programming exercise in Week 4.

Yes, the point is that the values are random with the Gaussian distribution, so it’s possible that some of them end up being very close to zero. But not all of them will at the same time. Or at least the probability of that happening is so low that it’s not worth worrying about. There is literally a non-zero probability that exactly one second from now, the Brownian motion of all the atmospheric particles in the room you are sitting in will end up compressed into one cubic centimeter of space up in the upper corner of the room and that you’ll suddenly be in a vacuum and your lungs will explode. According to the Laws of Physics, that could happen. But should you be worried about that? Probably not. :scream_cat:

The Application assignment in Week 4 is actually a really interesting case for this. It turns out that in the L layer case, the simplistic initialization they teach us in the Step by Step exercise just doesn’t work very well: you get really poor convergence. So they ended up giving us one of the more sophisticated algorithms that Prof Ng will teach us about in Course 2. But they do this silently just to avoid a) confusing us with more advanced information and b) revealing that they given us the worked answers to the Step by Step exercise. :laughing:

Give it a try and you can see this result for yourself! Just be careful not to give the function the same name as the “real” one or the grader will reject your code.