Okay, I’ll be honest. I’m starting this course late in the week because I just finished the previous course. Fortunately, I have some familiarity with the material of the previous course already, and have heard talks about some of the content in this course. But I haven’t actually implemented a neural network ever… so, I am finding myself surprised as I think this through.
I only just finished the first practice quiz in week1.
But, I was thinking about it, and I’m not sure I understand something. If all the neurons in the first layer receive the same input vector x, then why don’t they compute the same weights and the same activation function (probability or f(z) or whatever)? Does this rely on the neural network using stochastic processes, such as stochastic gradient descent, or something like that? If it were deterministic, the minimum of the cost function should be the same, right? Or is this a way of finding different local minima starting from different seed values in the parameters in the w vector?
Let’s suppose that {\bf x} is the input vector, {\bf W} is the matrix of weights, and {\bf b} is the bias. A single layer of the neural network performs the following computation {\bf a} = g({\bf x}^\top {\bf W}+ {\bf b}). Although the input {\bf x} is the same for all neurons in a layer (each column of {\bf W}), the outputs and activations {\bf a} differ because each neuron has different weights and bias. Neural networks initialize the weights and biases randomly. This randomness ensures that the neurons in a layer start with different parameters. Without this randomness (e.g., if all weights were initialized to the same value), all neurons in a layer would perform identical computations during forward propagation and would update identically during backward propagation. This would essentially collapse the neurons into a single effective neuron, defeating the purpose of having multiple neurons. UPD: fixed the notation to match lectures.
Okay, so I guess this is maybe what was confusing me… I think what you’re saying is that the whole thing is “solved” at once rather than separately minimizing each neuron with gradient descent before moving “forward” in the network? I don’t think I actually know how forward propogation and back propogation work yet, maybe I should keep watching. But it sounds like there’s some sort of simultaneous process going on with the different values, stepping them at the same time.
All of the weights in an NN are initialized to small random values. This is “symmetry breaking”, and if it isn’t performed, then you do in fact have very hidden layer unit learn exactly the same weight.
@s-dorsher, Please watch the Vectorization section. It shows how to implement neural networks effectively by using matrix and vector operations, so there is no need to minimize each neuron separately.