Better Activation functions: (tanh > sigmoid)

In this specialization, it was taught that we should use Sigmoid and Relu activations functions. but today we are going to explore some better options than these two.

Tanh activation function is better than sigmoid:

Remember the normalization? why did we use feature scaling and normalization? In fact, it’s preferred to normalize the inputs so that the average is close to zero. And these are the inputs to the first layer in our neural network, so why don’t we maintain that close to zero average for the inputs of the next hidden layers?

This heuristic should be applied at all layers which means that we want the average of the outputs of a node to be close to zero because these outputs are the inputs to the next layer.

tanh

The advantage of the tanh is that its range is between [-1:1] not [0:1] like in the sigmoid. However, when using an activation function for the output layer in a binary classification problem, it is more conventional to use the sigmoid function, as you only have labels of 0 and 1. so you need a function that predicts a number in this range.

People frequently ask why we can’t use tanh as the activation at the output layer for a binary classification and say yes is >= 0 and no is < 0. Here’s a thread about that which also shows that tanh and sigmoid are actually very closely related mathematically.

resources:
lecun-98b.pdf (433.4 KB)

5 Likes

Thanks @Osama_Saad_Farouk !

In the subject of ‘Better Activation Functions’ I’d like to add that, besides the multiple activation functions currently available in the frameworks, anyone could actually develop a custom activation function for any given model as long as 2 conditions are met:

  • The function is differentiable
  • The function is non-linear
7 Likes

Thanks for the series of posts, Osama! It supplements the lessons very well for those who want to know more!

2 Likes

Thank you, Chris. I find this very motivating to keep posting more of these posts :slight_smile:

That would be awesome! Sharing knowledge is vital to our community!

1 Like

I definitively love the initiative and hope that more of us do the same. I’ll try to come up with a topic for a next “Learning Post”.

3 Likes

Definitely! Thank you, Juan!

Thank you, Juan. I encourage you to do so, and I hope we inspire a lot of other mentors to do the same and share their knowledge with everyone.

Interesting. I am curious if there are additional reasons to use \mathrm{tanh} besides just its range and average value. For example if we let \sigma(z) denote the sigmoid function, then the function \gamma(z) = 2\sigma(z)-1 also has the range [-1,1] and average value 0 just like \mathrm{tanh}. Is there an advantage to using \textrm{tanh} over \gamma?

1 Like

Good questoin,

If you look at the graph below:

where tanh is plotted in red, and the shifted sigmoid function in blue. You will notice that the tanh is steeper and approaches zero more quickly, which results in a larger slope. At this point, you might wonder, why do we need a larger slope.

To dive deeper into this concept, we have to demonstrate the vanishing gradient problem:

For the nodes in a neural network with sigmoid activation functions, we know that the partial derivative of the sigmoid function reaches a maximum value of 0.25. When there are more layers in the network, the value of the product of derivative decreases until at some point the partial derivative of the loss function approaches a value close to zero, and the partial derivative vanishes. We call this the vanishing gradient problem. Here is a plot of the sigmoid funtion and it’s derivative:
derivative of sigmoid

As you can see the maximum value of the first derivative is 0.25 which is not the case in the first derivative of the tanh function as illustrated in the following graph:

derivative of tanh

which can get to 1, hence the advantage.

1 Like

A natural followup: since we are looking at a transformation of the sigmoid function \gamma(x) = 2\sigma(x)-1, we could also allow for ourselves to scale before transforming, so we could consider the family of functions \gamma_c(x) = 2\sigma(cx)-1 for any real number c. Experimenting on Desmos with values 1,2,3,4,... etc you’ll see that you can find functions in this family which go to zero faster or slower than \mathrm{tanh}. In fact, you’ll find something special for c=2, that the functions visibly overlap everywhere. That is because of the following:

2\sigma(2x)-1= \frac{2}{1+e^{-2x}}-1 = \frac{1-e^{-2x}}{1+e^{-2x}} = \frac{e^{2x}-1}{e^{2x}+1} = \mathrm{tanh}(x)

but if we want a steeper slope near the origin, and that is our only concern, then \gamma_3(x) = 2\sigma(3x)-1 would be a better choice than \textrm{tanh}(x).

1 Like

I think the only difference between tanh and sigmoid is the range, or the center of the range. They both suffer from the vanishing gradient problem because their gradients are most of the times smaller than 1. ReLU, however, does not have such nature as an activation, although ReLU has its own problem that any negative input becomes the same output which happens to be zero (aka the dying ReLU problem) which can be addressed with the leaky ReLU.

Also, I think sigmoid is popular for its role in the output layer for logistic problems which need outputs in the range from 0 to 1. Sigmoid and tanh are nowadays not popular for being used in hidden layer but ReLU is.

Cheers,
Raymond

1 Like

Both of the sigmoid function and the tanh function are closely related, you can start from any of them and by using some shifting and scaling, you end up with the other one.

But why use shifting and scaling on the sigmoid to get a result that is close to the result of tanh?

In addition, if you would like to come up with a custom activation function that works better than both of them, then great.

In the end, the purpose of this discussion is to make students new to machine learning aware that there are some alternatives that work better, and of course if you can do better with a custom activation function then it’s even better.

Thank you for your insightful comments.

1 Like

Isn’t the condition more general than being differentiable? Isn’t having a subgradient enough?

Is there not also a point about ReLU having been used for it led to faster computations in both forward and backward propagation compared to ‘‘more complex’’ activation functions?

@RR5555

Thank you for making a good point that computing ReLU is faster!

Raymond

@RR5555 @rmwkwok
As I said before, of course the Relu function is preferred and there are also the leaky Relu that is even better. But the general purpose of this discussion is not to choose what is the best activation function to always use in practical. The goal is to bring the attention of new learners about different activation functions, different aspects about what makes an activation function better. I’m actually planning to post a new topic about the different kinds of the Relu and whether it is always better to use it or not.

@rmwkwok @RR5555
Here is the thread I told you about:

From what I know and from what I’ve read, the two conditions that are always mentioned are those listed above. May be you could shed light on this topic, as I find it interesting that there could be this other alternative, yet I have not been able to find documentation about it.

Why do you think that having sub-gradient should be enough? do you know of an activation function that is based on sub-gradient? any other insight that would support this proposed alternative?

Thanks!

Juan