Week 1 ReLU vs mod(x)

Prof Ng said that ReLU is faster and solves the vanishing gradient problem. I found following arguments on why it is faster -

  1. For some of the neurons, derivative will be 0 and they will not be trained hence reducing the training complexity - We should have the same advantage for sigmoid too
  2. Calculating derivative is easier

Also, I don’t really understand how it solves the vanishing gradient problem when the complete negative x range gives 0 derivative in case of ReLU. It only solves the problem for positive x range. I saw some variations of ReLU that solve this problem.

But what if we just use modulus(x) instead of ReLU? It has simple derivate and also gives non zero and sufficiently large values of derivate when used for negative x. So it won’t have vanishing gradient problem also for any range of x.

1 Like

Hey @Rajat_Goyal,
Welcome to the community. Firstly, let me highlight why we don’t the same advantage for the sigmoid function, as you mentioned. The tails for the sigmoid function are indefinitely long, hence it doesn’t matter how large or small the input is, the derivative in the case of the sigmoid function will always be calculated. And as to why ReLU is faster, your arguments are very much correct according to me.

Your next point is also correct, i.e., the ReLU function only solves the vanishing gradient problem for the positive range, and hence, there exist some variations such as Parametric ReLU, Leaky ReLU, etc, but you will find that in most of the practical scenarios, ReLU works quite well.

As for the modulus(x) as an activation function, that’s something I never thought of, and would very much like to know the answer myself :innocent:

Regards,
Elemento

Hi, @Rajat_Goyal. Take a look the response by @paulinpaloalto in a similar discussion. But to your first assertion, we do not have that advantage for sigmoid, because its slope is everywhere positive. It is only zero in the limits z \rightarrow -\infty and z \rightarrow +\infty. That’s exactly the problem: gradient descent runs the risk of becoming very slow because it’s spending a lot of time at small and/or large values of z because the gradient is changing only slowly. When it’s pegged at zero for negative z's, there is no change.

As for the modulus (i.e. absolute value) function, I do not have a definitive answer. Just because I have never seen it applied is not a basis for rejecting the idea out of hand for all applications! :thinking:. Intuitively, the negatively-sloped part, i.e.z<0, “turns the volume down” on the neuron as z increases, rather than up. In gradient descent, the shape of the activation works in concert with the shape of the cost function, so it’s hard for me to imagine the result for the general case. Worth figuring and pondering!

2 Likes

Ken and Vishesh have covered the important issues, but maybe the one additional thing to point out is that another way ReLU is faster is not just that it gives better convergence in a lot of cases but it is literally cheap to compute. The activation functions like sigmoid, tanh, and swish which involve the exponential function take a lot more computation: you’re doing something equivalent to expanding a Taylor Series to compute e^z. That’s pretty heavyweight compared to just a simple comparison z > 0. Training a deep neural network with lots of parameters is a pretty expensive proposition already, so if you can save cpu cycles and still get good convergence, that’s a big win.

On the question about using absolute value, I hadn’t thought of that either. You could always try it and see how it works. People frequently observe that it’s not differentiable at z = 0, but neither is ReLU and that doesn’t seem to be a problem: just use one of the limit values for the derivative at z = 0. But one thing to note is that all the other functions we use as activations are monotonic (not strictly in the case of ReLU, but monotonic non-decreasing) or very close to it. In the case of swish, there’s one region where it dips slightly. I think I asked the question once upon a time whether it was technically required that an activation function be monotonic and someone on the course staff said that it was not strictly required and pointed to swish, but I didn’t pursue the issue any further. Maybe worth a little google searching!

If you try modulus, please let us know what you learn. This is an experimental science! :nerd_face:

1 Like