In the demand prediction problem, what is the use of ReLU in finding the activation of 'awareness'

Activation functions other than the sigmoid activation function were discussed in this video.

In the explanation of the preference for ReLU over the sigmoid activation function, Professor Ng mentioned that this was because in the latter case, we were assuming that awareness could only be 0 or 1, but a person can either be fully aware or only slightly aware.

For this reason, as the professor explained, we have chosen the ReLU function, which can take values over a large range.
But my doubt is: Why is sigmoid restricted to only 0 or 1? I thought about the loss function being the issue(-ylog(x)-(1-y)log(1-x)). But this adjusts its minima to x=y, and works for all y belonging to [0,1].
I think the sigmoid function gives an output that can be related to the ā€˜amount’ of awareness an individual has about the product. So, a 0.3 activation would imply that the person is not that aware of this product, whereas 0.8 would mean that he is fairly knowledgeable about it.

So, shouldn’t the sigmoid function also be a good activation for awareness?

I don’t fully comprehend the intuitive explanation that Andrew uses here.

A definitive reason for using ReLU is that the partial derivative is extremely easy to compute. It’s either 1 (for values >= 0), or 0 for values < 0. This makes minimizing a cost function that uses ReLU units computationally inexpensive.

It’s more efficient than having to evaluate an exponential.

ReLU also has the handy characteristic that the output value can be any real non-negative number.

The drawback is that the output cannot be negative. So in order to learn outputs with negative values, you need more ReLU units, such that some of them can learn a negative weight value, in order to generate inputs to the next layer that are less than zero.

2 Likes

There are few problems with sigmoid activation. For large positive or negative inputs, the gradient of the sigmoid becomes very small. During backpropagation, gradients shrink layer by layer, making it hard for deep networks to learn. Inputs with high magnitude saturate the output (push it close to 0 or 1), where the gradient is nearly zero. Neurons become ā€œstuckā€ and stop learning. Because of the above issues, sigmoid is rarely used in hidden layers today, but it’s commonly used in the output layer to produce a probability between 0 and 1.

ReLU activation is not perfect as well. If a neuron gets stuck in the negative region, it always outputs 0, and its gradient is 0. The neuron stops learning and can be permanently ā€œdeadā€ if it never activates again. Using Leaky ReLU, Parametric ReLU, SELU activations helps to mitigate the problem. ReLU can output very large values for large inputs, which can cause exploding activations in deep networks, making optimization unstable. Batch normalization can help to mitigate this problem.

1 Like