Understanding how Neural Networks Learn

I’m having trouble understanding how exactly the Neural Networks we build learn. For example: The CoffeRoasting Neural Network we build. If all of the neurons get all of the input data, how do the neurons know which feature to look at specifically? If every neuron gets the same data and every neuron has the same activation function, why isn’t the output the same for all of the neurons?

1 Like

It is a great point. You’ve given the justification for why we need what is called “Symmetry Breaking”: we start by randomly initializing all the weights. That way each neuron starts out giving a different output and then back propagation pushes them all towards better and better solutions. Since they start out different, they end up differently as well. The next level of subtlety is that if you run the experiment again with different random initializations, what is learned by a given neuron may not be the same, but it is pretty likely that the same things will be learned as in the previous training run, just by different neurons.

3 Likes

Thank you! I haven’t thought about the randomly assigned initial weights. I hope you don’t mind a follow-up question on this:
In the Multiclass lab we see that classifying multiple classes (here the “blobs”) is done by the neurons in the first layer dividing 2 of the 4 blobs each. Is there a way to understand why the neuron chose to do it this way? Each neuron gets all the data so why does it not try to separate all of the four blobs in one go but rather divide the graph into two areas with two blobs each?

Edit: I think I may have found the answer myself: Could it be that the error function is simply minimal when dividing the blobs into 4 each? Looking at the graph I don’t think there is a better option to divide it when thinking about the error function… This would leave 2 options, one horizontal line and one vertical. But then I wonder how we get to the point that it will not divide using to similar lines in 50% of the cases since errors are evaluated for each neuron separately, right? Or does TensorFlow compare the overall errors of different neuron constellations in each run as well?

Hi, Niclas.

Sure, followup questions are great. The only problem is I may have overstepped my bounds here a little bit: I’m a mentor for DLS, but I have not actually taken MLS yet. Meaning that I have not seen that material, so I don’t really know what Prof Ng is saying there. I can only guess that he’s just giving intuition about how the learning might happen, not saying that it necessarily does exactly that every time you run the training with a different random starting point.

Sorry if I’m a bit off the mark here. Maybe one of the real MLS mentors can give a more specific answer based on having actually watched that lecture.

A second answer based on your “edit” section:

The losses are “per sample” and then averaged to get the scalar cost and the gradients are derivatives of the scalar cost. At that level, for each sample, you only know the final answer and how far it is from the correct label for that sample, but you don’t have any visibility into how much each neuron contributed to that answer. Then what happens in back propagation is that those gradients get calculated at each layer of the network and that’s the point at which the contribution of each individual neuron plays into the answer one step at a time backwards through the layers. Of course there are typically several or many layers in a network, so things keep getting divided and “projected” backwards from the outputs to the inputs at each layer.

1 Like

Thank you @paulinpaloalto!!

Hello @Niclas_B,

I think you are talking about these plots from Course 2 Week 2 Lab Mutliclass:

image

The NN has a layers of 2 neurons and a layer of 4 neurons. @paulinpaloalto has explained how the neurons become different (or express different boundaries in this reply’s context), and the limitation here is that each of the 2 neurons from the first layer can only express straight boundary lines, so the best they can do are dividing 4 blobs into 2 groups.

These divisions turn out to be helpful for the output layer which takes the first layer’s output! Those on the right hand side (RHS) of the first boundary line AND on the LHS of the second line are orange points. Similarly, those on the LHS of the first line AND on the RHS of the second line are green points.

Since the 2 boundary lines in the first layer can reduce the cost function by helping the output layer distinguish the four blobs in that way, it makes sense that they look what they are!

Cheers,
Raymond

1 Like

Why is it “pretty likely that the same things will be learned as in the previous training run, just by different neurons”? Does it due to the dataset (the same set of feature values limits the NN in possible combinations of weights)?

Yes, the training data is the same and the architecture of the network (number of layers and number of neurons) and the loss function are all the same, so the same things (features) are there to be learned. But since you’re starting with random weights, you can’t predict which exact neuron gets pushed towards learning a given thing as the back propagation happens. Of course it could end up exactly the same, but it may not.

1 Like