In the video Choosing activation functions, Andrew outlines his recommendations for which activation function to use for the output layer depending on the range of possible values. This makes sense to me.
However, he then proceeds to recommend that “… then for the hidden layers, I would recommend just using ReLU as the default activation function.”
I’m struggling to understand how this could optimise the model performance if the range of possible values extends into the negative space. Wouldn’t the hidden layers effectively filter out/diminish all negative inputs before they even arrive at the output layer?
Wouldn’t the hidden layers effectively filter out/diminish all negative inputs before they even arrive at the output layer?
The ReLU unit will output 0 when the linear combination of its inputs is zero or negative and not when the inputs are negative. A negative weight for a negative input will still be expressed as a non-zero value at the ReLU unit.
I keep forgetting that the activation function is applied to the original z-function so it’s still possible that negative inputs can result in a positive value so long as, as you mention, the weight is negative (or I guess the bias could be a large enough negative number too).
Since ReLU units can get stuck on negative inputs, typically one has to use a lot more ReLU units in the hidden layer than if another activation (like sigmoid) was used.
The benefit to ReLU is that the activation is extremely easy to compute.