Week 1 ReLU vs mod(x)

Hi, @Rajat_Goyal. Take a look the response by @paulinpaloalto in a similar discussion. But to your first assertion, we do not have that advantage for sigmoid, because its slope is everywhere positive. It is only zero in the limits z \rightarrow -\infty and z \rightarrow +\infty. That’s exactly the problem: gradient descent runs the risk of becoming very slow because it’s spending a lot of time at small and/or large values of z because the gradient is changing only slowly. When it’s pegged at zero for negative z's, there is no change.

As for the modulus (i.e. absolute value) function, I do not have a definitive answer. Just because I have never seen it applied is not a basis for rejecting the idea out of hand for all applications! :thinking:. Intuitively, the negatively-sloped part, i.e.z<0, “turns the volume down” on the neuron as z increases, rather than up. In gradient descent, the shape of the activation works in concert with the shape of the cost function, so it’s hard for me to imagine the result for the general case. Worth figuring and pondering!

2 Likes