Activation functions other than the sigmoid activation function were discussed in this video.
In the explanation of the preference for ReLU over the sigmoid activation function, Professor Ng mentioned that this was because in the latter case, we were assuming that awareness could only be 0 or 1, but a person can either be fully aware or only slightly aware.
For this reason, as the professor explained, we have chosen the ReLU function, which can take values over a large range.
But my doubt is: Why is sigmoid restricted to only 0 or 1? I thought about the loss function being the issue(-ylog(x)-(1-y)log(1-x)). But this adjusts its minima to x=y, and works for all y belonging to [0,1].
I think the sigmoid function gives an output that can be related to the āamountā of awareness an individual has about the product. So, a 0.3 activation would imply that the person is not that aware of this product, whereas 0.8 would mean that he is fairly knowledgeable about it.
So, shouldnāt the sigmoid function also be a good activation for awareness?
I donāt fully comprehend the intuitive explanation that Andrew uses here.
A definitive reason for using ReLU is that the partial derivative is extremely easy to compute. Itās either 1 (for values >= 0), or 0 for values < 0. This makes minimizing a cost function that uses ReLU units computationally inexpensive.
Itās more efficient than having to evaluate an exponential.
ReLU also has the handy characteristic that the output value can be any real non-negative number.
The drawback is that the output cannot be negative. So in order to learn outputs with negative values, you need more ReLU units, such that some of them can learn a negative weight value, in order to generate inputs to the next layer that are less than zero.
There are few problems with sigmoid activation. For large positive or negative inputs, the gradient of the sigmoid becomes very small. During backpropagation, gradients shrink layer by layer, making it hard for deep networks to learn. Inputs with high magnitude saturate the output (push it close to 0 or 1), where the gradient is nearly zero. Neurons become āstuckā and stop learning. Because of the above issues, sigmoid is rarely used in hidden layers today, but itās commonly used in the output layer to produce a probability between 0 and 1.
ReLU activation is not perfect as well. If a neuron gets stuck in the negative region, it always outputs 0, and its gradient is 0. The neuron stops learning and can be permanently ādeadā if it never activates again. Using Leaky ReLU, Parametric ReLU, SELU activations helps to mitigate the problem. ReLU can output very large values for large inputs, which can cause exploding activations in deep networks, making optimization unstable. Batch normalization can help to mitigate this problem.