DL and NN course1 Week#3: Understanding Activation functions

Hi all,

I’ve couple of questions regarding different activation functions and how do they compare against each other. I’m trying to get a feel of how the nature of activation function eventually help in learning faster/better. I’d appreciate an “intuitive reasoning” if the math is too hard. But hey, I’m up for some math too :wink:

  1. What do we mean when we say tanh almost always works better than sigmoid as it centers the activation value mean to 0. How does it really help in faster learning? Does higher slope of tanh near zero help in better learning too?
  2. How does ReLu activation help in better performance -as its almost linear? If the goal was to have non-zero gradients away from zero, why was there any non-linear function used instead of linear function for positive Z value -say square or exponential? Would higher gradients it lead to vanishing/ exploding grads issue?
  3. Also when we try to keep w smaller (scaling by 0.01) so that we don’t end up in low gradient regions of Sigmoid function in the subsequent layers, aren’t we restricting our function closer to zero where its almost a linear function?- does that reduce the ability to capture non-linearity of NN model? (This could ve related to answer from 2, but wanted to make sure that I get this right)

Thanks,
Hari

tanh isn’t necessarily faster in convergence.

Since the output is symmetric around zero, it is well-matched with outputs that have real values. Compared to sigmoid activation, which gives values between 0 and 1, so it is not well suited to real values (which could be both negative and positive).

ReLU would only be used in a hidden layer.

Whether its performance is “better” depends on how you define “better”. The gradients of ReLU are very easily computed. They’re either 0 or 1. However, since ReLU units don’t learn anything for negative inputs, you need a lot more of them (since some will learn negative weights, in order to give inverted outputs).

So a hidden layer with ReLU units will need a lot more of them than if the hidden layer had sigmoid or tanh.

Scaling the features helps keep the magnitude of the gradients in the same range - this does two things:

  • Allows for selecting a higher learning rate, without risking the cost diverging to +Inf.
  • Keeps the activations away from the regions where they start to reach their limiting values.
1 Like

Note that there are quite a few other activation functions to do fine tuning or elicit better behavior in thi or that case. This is mostly decided empirically.

The old “sign(x)” seems to have been dropped completely :face_with_tears_of_joy:

2 Likes