C2_W2_Relu activation lab

I have been curious to learn the details of exactly how this works, so I did an experiment.

I created a training set that is a parabolic curve, where x goes from -5 to +4, and y = x^2.

I set up a 2-layer NN, with one input unit, 5 hidden layer units with ReLU activation, and one output unit with linear activation.

It converged nicely, and here is a plot of ‘y’ and ‘y-hat’.

Here’s what is inside each ReLU unit:
z = max(0, w*x + b)
So each ReLU unit can optimize two values - the slope of its line segment, and the bias value where the output is forced to 0.

  • If ‘w’ is negative, then the curve looks like '\_".
  • if ‘w’ is positive, then it looks like “_/”.

Looking just at what the ReLU units are learning, here is a plot that shows the output of each ReLU unit (1 through 5)

You can see that two units have negative ‘w’, and three have positive. All five units have different bias values, which allows each curve to shift vertically. Because of the shape of this training set (all y values are positive), all of the bias values are negative.

All of the units in this example have slightly different slope values - it’s subtle but evident. Here are the biases and weights for the hidden layer:
image

At the output layer, each of these ReLU outputs is multiplied by an output weight, and added to the output bias.
So again there is a chance to a re-scale each ReLU output before they are all summed-together in the output unit.

Here are the weights and bias for the output unit:
image

Conclusion:
It is incorrect to say that each ReLU unit learns one segment of a piecewise linear function. Each unit does contribute a linear segment, but the final shape of the model output also depends on the weighted sum of all of the ReLU outputs.

3 Likes