Better Activation Functions (part 2)

In my last thread, [you can check it here] (Better Activation functions: (tanh > sigmoid)). I talked about why tanh is better than sigmoid in the hidden layers of neural networks. Today I’m going to share with you all about the Relu activation function and its variants. So, what is the Relu activation function?

The Rectified Linear Unit (Relu): is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for any positive value xx it returns that value back. So it can be written as f(x)=max(0,x).

Graphically it looks like this:


Why It Works

Introducing Interactions and Non-linearities

Activation functions serve two primary purposes: 1) Help a model account for interaction effects.
What is an interactive effect? It is when one variable A affects a prediction differently depending on the value of B. For example, if my model wanted to know whether a certain body weight indicated an increased risk of diabetes, it would have to know an individual’s height. Some bodyweights indicate elevated risks for short people, while indicating good health for tall people. So, the effect of body weight on diabetes risk depends on height, and we would say that weight and height have an interaction effect.

  1. Help a model account for non-linear effects. This just means that if I graph a variable on the horizontal axis, and my predictions on the vertical axis, it isn’t a straight line. Or said another way, the effect of increasing the predictor by one is different at different values of that predictor.

How ReLU captures Interactions and Non-Linearities

Interactions: Imagine a single node in a neural network model. For simplicity, assume it has two inputs, called A and B. The weights from A and B into our node are 2 and 3 respectively. So the node output is f(2A+3B)f(2A+3B). We’ll use the ReLU function for our f. So, if 2A+3B2A+3B is positive, the output value of our node is also 2A+3B2A+3B. If 2A+3B2A+3B is negative, the output value of our node is 0.

For concreteness, consider a case where A=1 and B=1. The output is 2A+3B2A+3B, and if A increases, then the output increases too. On the other hand, if B=-100 then the output is 0, and if A increases moderately, the output remains 0. So A might increase our output, or it might not. It just depends what the value of B is.

Non-linearities: A function is non-linear if the slope isn’t constant. So, the ReLU function is non-linear around 0, but the slope is always either 0 (for negative values) or 1 (for positive values). That’s a very limited type of non-linearity.

But two facts about deep learning models allow us to create many different types of non-linearities from how we combine ReLU nodes.

First, most models include a bias term for each node. The bias term is just a constant number that is determined during model training. For simplicity, consider a node with a single input called A, and a bias. If the bias term takes a value of 7, then the node output is f(7+A). In this case, if A is less than -7, the output is 0 and the slope is 0. If A is greater than -7, then the node’s output is 7+A, and the slope is 1.

So the bias term allows us to move where the slope changes. So far, it still appears we can have only two different slopes.

However, real models have many nodes. Each node (even within a single layer) can have a different value for it’s bias, so each node can change slope at different values for our input.

Advantage of Relu

One major benefit is the reduced likelihood of the gradient to vanish. This arises when a>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids and tanh becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.

Also when training on a reasonable sized batch, there will usually be some data points giving positive values to any given node. So the average derivative is rarely close to 0, which allows gradient descent to keep progressing and you can build bigger models with it easily.

Why is the Relu activation function so much better than the linear activation function even though half of them is exactly the same?

As we discussed earlier in this thred, the whole point of an activation function is to be non-linear.

Let me explain why. If you have a network of multiple layers (so-called “deep” against a “shallow” network), then your model could potentially learn to detect or handle much more sophisticated examples. When operating, your network would utilize more interconnections between weights (and most likely more weights). When doing the calculations, a lot of that means actually multiplying and adding numbers together, like:


with f(⋅) being some activation function.

When you “cascade” the layers, then output of each layer becomes an input to the next one. For 2 layers for example:


However, if f(⋅)f(⋅) was a linear function f(x)=αx+βf(x)=αx+β then the whole network would “collapse” to just one-layer network, simply because it any linear combination of linear functions is a linear function itself.

But relu isn’t perfect, it has some problems. I was going to discuss them now but in order not to make this thread too long. I tought it might be better to make another thread that explains the disadvantages of the relu and some solutions to it. So wait for the next part in an upcoming thread.

Thank you for reading.