In all examples The inputs n[0] to the nodes in hidden layer 1 are same. If lets say we have two inputs and 4 nodes in hidden layer 1 then each node in layer1 will take same two inputs. If we will run the model for longer iteration then will it make all rows of W[1] same?
Does the purpose of initializing W randomly and use of multiple nodes and layers in neural network is to get the revised values of W in less time.
Is my understanding correct that all the functions in the nodes are doing same thing but the inputs are different hence multiple neurons are reducing the training time of the model. If we will use simple Logistic regression instead of neural network then it will take long time to get values of W and b as compared to neural network.
The weights are randomly initialized so they differ from neuron to neuron!
A neural network includes several logistic (or others too, like relu, tanh etc.) regressions in parallel and in sequence not just for speed’s sake but also to be able to fit complex models that a logistic regression cannot just accomplish by itself because the curve of the logistic regression is very limited!
It means in next classes we will learn using different activation functions on different nodes to improve the model.
Will it change the derivative formula also ?
You will see in the classes!
You’re right about the high level point: in each layer, every neuron gets all the outputs from the previous layer as inputs. So if we started out with all the same weights then every neuron would give them same answer and the derivatives would be the same as well. That would mean there would literally be no point in having multiple output neurons.
We need to randomly initialize all the weight values so that they all start out being different and then will continue to learn different things as training takes place. This is called “Symmetry Breaking”. Prof Ng mentions this in the lectures but doesn’t go into a lot of detail. Here’s a thread which discusses in more detail.
We’ve discussed this before, haven’t we? Yes, the gradients are the derivatives of the actual functions that are being used in the various layers, so if you change the activation function, then it changes one of the factors in that Chain Rule calculation. Remember we were talking about formulas like this one on that thread yesterday:
dZ^{[1]} = W^{[2]T}dZ^{[2]} * g^{[1]'}(Z^{[1]})
The term g^{[1]'}(Z^{[1]}) is the derivative of the activation function at layer 1, right? So, yes, if you change the function, it changes the derivative term, which then changes the overall gradients. That’s the reason we write that formula in a general way: we have a choice of activation functions at the hidden layers of the network. In the output layer, we have the advantage that we know what the activation is if it’s a binary classification: it must be sigmoid
, so we can just apply the derivative of sigmoid
and the cross entropy loss function at the output layer.
Thanks alot for explanation. Its clear.