Hi, I have a couple of questions about initializing weights.
- Looking at several variations of weights initialization, it seems that the general formula to implement it is by the following. Is it true?
np.random.randn(shape) * standard deviation(w_i)
- Suppose we have a network with four layers, with ReLU is used for layer 1 and 2, then tanh function is used for layer 3, and finally sigmoid is used for layer 4 (if we put it as a list of activation functions used for each layer, that would be [ReLU, ReLU, tanh, sigmoid]). Do we implement He initialization to all layers (even if layer 3 and 4 doesn’t use ReLU) or only to the first and second layer (W_1 and W_2)?
Looking from the programming assignment, I think the He initialization is done to all layers, but I’m not sure.
For 1) I assume you mean "the desired standard deviation of W^{[l]}". I guess you could think of it that way since the output of randn is a Gaussian (normal) distribution with \mu = 0 and \sigma = 1.
For 2), yes, the initialization must be done for all layers. This is required for Symmetry Breaking. The question of whether there is any evidence that you might want to consider using a different initialization method if the activation functions are different is an interesting one. I don’t remember Prof Ng saying anything about that, but it’s been a while since I watched the lectures. If the lectures are current in your mind, did you notice him bringing up that topic?
In the video lecture, the context is we have a one-layer network, and that layer has only one neuron. So a really simple network.
For 1), it’s actually written as w_i in the lecture video (my apologies for writing “w_i”. I didn’t know how to type w_i). But I guess we would have W^{[l]} if we have a deeper network. Anyways, your answer about \sigma just remind me of why we multiply the matrix by the standard deviation. It’s because standard deviation has the same unit as our data, whereas the units in variance is squared. I don’t know how to explain this clearly but I learned it in statistics.
For 2), well… since the network being discussed has only one layer, Prof Ng only said about the initialization we should use if the activation function is ReLU. He didn’t say anything about what happen if we have more than one layer.
I’m glad to see that you found the information about how to use LaTeX for mathematical expressions. Of course w_i just refers to one element of the weight vector w in the case of logistic regression. W^{[l]} refers to the entire weight matrix for layer l of a multilayer network.
The initialization applies at all layers in a multi-layer network. In all the examples I’ve seen Prof Ng discuss, he uses the same technique at all layers. But the point is that there are different methods of initialization and there is no guarantee that any given method will give good results on any particular problem. So the choice of initialization method is yet another “hyperparameter”, meaning a decision that you need to make as the system designer and then verify whether you have made a good choice or not. This is one of the major themes of Week 1 of Course 2.