Alternatives to sigmoid function

in the lecture, it was mentioned that if we use sigmoid, awareness is assumed as binary 0 or 1. but the hidden unit calculating awareness actually outputs a number n,0<n<1.like n = 0.7. n itself shows the level of awareness of. 0.99 is very high awareness. we input n =0.7 to calculate a2_1. I’m confused. am i missing something? anyone,pls explain. sigmoid function outputs the range

1 Like

Hello @karra1729, you are absolutely right that sigmoid outputs between 0 and 1, and I think Professor Andrew Ng also mentioned in Course 2 Week 2 Video “Alternatives to the sigmoid activation” at ~1min 31sec that

So whereas previously we had used this (sigmoid) equation to calculate the activation of that second hidden unit estimating awareness where g was the sigmoid function and just goes between 0 and 1…

Cheers!

But it seems like the degree to which possible buyers are aware of the t shirt

you’re selling may not be binary, they can be a little bit aware,

somewhat aware, extremely aware or it could have gone completely viral.

So rather than modelling awareness as a tiny number 0, 1,

that you try to estimate the probability of awareness or

rather than modelling awareness is just a number between 0 and 1.

Maybe awareness should be any non-negative number because there can be any non

negative value of awareness going from 0 up to very very large numbers.

my doubt:

why can’t we model awareness as a number between 0 and 1?like
0.01 - completely unaware
0.2 - a little bit aware
0.5 - slightly aware
0.75 - extremely aware
0.99 - completely gone viral.

why does it have to be a large non-ngative number? suppose 56000 is extremely aware I can map 56000 to a number between 0,1(somewhere around 0.8) and the weights will be calculated accordingly.isn’t calculating awareness as probability slightly better than a large number?like i understand both methods work,my main doubt is "is there a necessity for values to be >1 and why is it better than (0,1)
I’m sorry if this is stupid doubt. thankyou in advance

Indeed it is a great question!

Since I believe you agree the main issue behind is about the output range, and not specific to awareness or any other names, I think we can rephrase the question into: why ReLU over sigmoid (or the other way around) for hidden layers.

This topic isn’t discussed in depth given the scope of the currently released course 1 & 2. I am not sure about course 3 because it is not released yet, but I did a quick scan on the deep learning specialization (DLS) and found that its course 1 week 3 Video “Activation Function” has compared sigmoid and ReLU.

I highly recommend you to watch the video yourself, but in short, comparing to ReLU, (1) Sigmoid is computationally much slower because it involves computing the exponential function, and (2) Sigmoid gives a very small gradient value (close to 0) at the far positive region which makes some parameters only have a very small update step size which results in a slow training process.

A topic that is relevant to the second point above is the “vanishing gradient problem” which is also introduced in the DLS Course 2 Week 1 Video “Vanishing / Exploding gradient”, again please watch it if you want to know more about it, but both the small gradient values and the fact that sigmoid always accept a larger range of input and produces a smaller range of output contribute to the problem, and this problem becomes significant as you grow your neural network deeper.

If DLS mentor happens to see this post, I hope they would share more insights or past discussions with us.

3 Likes

Thankyou so much for making soo much effort to clear my doubt.

1 Like

You are welcome @karra1729

1 Like

After being introduced to ReLU (and hearing that it is the most popular activation function), I couldn’t stop wondering:
if some training case makes some hidden neuron’s ReLU 0,
that means that the partial derivatives of the LOSS function (for this particular training case) with respect to w and b for this neuron are also going to be 0 .
So the only way that these w and b can get updated is if the remaining training cases influence the overall average (the derivatives of the COST function with respect to these w and b).

With the randomly initialized w and b it can happen that ReLU = 0 for all training examples, and in that case w and b of that neuron will never be updated.

On the other hand, with the sigmoid, for example, even though the derivatives can be really small, they are still non-zero.

Turns out, this is a known problem of ReLU:

For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.

– the quote is taken from here:
https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html.

This website has the pros and cons of the different activation functions, so it can be interesting to someone else. It presents some alternatives to ReLU which, in my opinion, should solve the the “dying ReLu problem”: ELU and LeakyRelu (this one was also mentioned by Prof. Andrew Ng).

My question:
Is this really not such a significant problem, even if we use ReLU for ALL the hidden neurons? If most people use ReLU for all the hidden neurons, how come this is not an issue for them?

With the randomly initiated weights - I would expect this problem to happen very often.
So, I’m surprised…

Does anyone have any comments about this?

P.S.
I can’t wait to take the Deep Learning Specialization to understand more :slightly_smiling_face:.

Hello @VladimirFokow, thanks for sharing with us about the dying ReLU problem. Let me quote from this interesting paper by L. Lu, Y. Shin, Y. Su & G. E. Karniadakis that analyzed the problem.

Paper says:

… a 10-layer NN of width 10 has a probability of dying less than 1% whereas a 10-layer NN of width 5 has a probability of dying greater than 10%; for width of three the probability is about 60% … Our results explain why deep and narrow networks have not been popularly employed in practice and reveal a limitation of training extremely deep networks. (page 8)

Their analysis result suggests that the born dead problem is significant in deep and narrow network where the weights are initialized by a symmetric probability distribution, but popular deep networks are wide. Note also that their meaning of a “dying” neural network and “born dead” is:

In this paper, we focus on the worst case of dying ReLU, where the entire network dies, i.e.,
the network becomes a constant function. We refer this as the dying ReLU neural network. We
then define two phases: (1) a network is dead before training, and (2) a network is dead after
training. The phase 1 implies the phase 2, but not vice versa. When the phase 1 happens, we say the network is born dead (BD). (page 4)


The paper also discusses 3 common ways of attacking the problem:

  1. replace the activation function - as you also mentioned - e.g. leaky ReLU.
  2. use batch normalization - a technique to transform the output of a hidden layer to be zero mean and unit variance. (Batch normalization is also covered in the Deep Learning Specialization)
  3. replace the weight initialization method - the authors suggested the “randomized asymmetric initialization” (RAI) to overcome the dying ReLU

In summary, the paper discussed about how likely ReLU can cause a born dead neural network depending on the NN’s depth & width, and since the born dead problem is highly relevant to the weights initialization, they suggested their RAI method to overcome that.

Cheers!

2 Likes

Very interesting and informative!
Thank you very much!

You are welcome @VladimirFokow!