I do not understand the concept of incrementally building the curve. The lab seems to be suggesting that each node in layer 1 is responsible for a piece or segment of the graph and each nodes output builds on the previous node output.
How does each node know where to start on the graph.
There is talk of ‘cut off’ and blocking contribution.
Can someone explain this lad to me or point me at something I can read
Hello @Paddy,
Ofcourse the nodes cannot know anything in prior, but there is a driving force to make something happen - gradient descent that minimizes the loss at training.
To minimize the loss, the curve should resemble the truth as much as possible, because the loss function calculates exactly the error between the curve and the truth. Less error → more resembled.
How gradient descent works is beyond the scope of this answer.
ReLU is a piecewise linear function of this shape → " _/ ", so there are not many ways for such shape to finally resemble the truth but what was shown in the lab is one such way, and it turns to be a way that can perfectly match with the truth.
The nodes are usually randomly initialized before training starts, so which node will finally be responsible for which “segment” of the truth pretty depends on their own initial values and cannot be known in advance. You can imagine things like this: initilally, the nodes have random parameters so the curve is nothing like the truth, but over gradient descent, it keeps tuning the parameters in all the nodes so that eventually, the curve looks more and more like the truth.
Again, how gradient descent works is beyond the scope of this answer.
Having said all the above, there is also a chance that the training can end up with a different version of curve that does not well overlap with the truth - we call it getting stuck at a local minimum. Therefore, nothing actually guarantees a perfect curve, or even a good curve, but only a curve that will get the loss minimized - be it just locally minimized. In other words, even though the lab gives us that idea, we cannot take it for granted that the trained curve is always good, not to mention the nodes can “know” anything.
What could the curve look like if we have one node less in the lab? Note that gradient descent is still doing the same job - minimize the errors. In this case, can any node still be just responsible for one segment? What if we have two nodes less? We can keep asking ourselves questions.
Cheers,
Raymond
I have been curious to learn the details of exactly how this works, so I did an experiment.
I created a training set that is a parabolic curve, where x goes from -5 to +4, and y = x^2.
I set up a 2-layer NN, with one input unit, 5 hidden layer units with ReLU activation, and one output unit with linear activation.
It converged nicely, and here is a plot of ‘y’ and ‘y-hat’.
Here’s what is inside each ReLU unit:
z = max(0, w*x + b)
So each ReLU unit can optimize two values - the slope of its line segment, and the bias value where the output is forced to 0.
- If ‘w’ is negative, then the curve looks like '
\_
". - if ‘w’ is positive, then it looks like “
_/
”.
Looking just at what the ReLU units are learning, here is a plot that shows the output of each ReLU unit (1 through 5)
You can see that two units have negative ‘w’, and three have positive. All five units have different bias values, which allows each curve to shift vertically. Because of the shape of this training set (all y values are positive), all of the bias values are negative.
All of the units in this example have slightly different slope values - it’s subtle but evident. Here are the biases and weights for the hidden layer:
At the output layer, each of these ReLU outputs is multiplied by an output weight, and added to the output bias.
So again there is a chance to a re-scale each ReLU output before they are all summed-together in the output unit.
Here are the weights and bias for the output unit:
Conclusion:
It is incorrect to say that each ReLU unit learns one segment of a piecewise linear function. Each unit does contribute a linear segment, but the final shape of the model output also depends on the weighted sum of all of the ReLU outputs.
That is wonderful! Thanks, Tom!
Cheers,
Raymond
impressive. thank you