Course 2 - ML Specialization - Why do we need Activation Functions?

Hi All,

In Week 2 of Advanced Learning Algorithm I didnt understand the basis of Mr. Ng’s lecture “Why do we need Activation Functions?”. Especially I don’t understand the math on the first slide when a = wx + b. Can anyone please explain this to me. Thank you.

1 Like

Hi @shsuratwala,

A simple example here. You know, price discount is a linear transformation, and a 20% off discount for a $100 product is just a transformation to 100 * 0.8 = 80.

Now, if I give three discounts to the product - 20% off for winter sale, extra 25% off for limited time offer, and another extra 10% off for membership offer, then the transformation will be 100 * 0.8 * 0.75 * 0.9 = 54.

In above, there are three linear transformations, but in our hearts, we know we can replace them with just one linear transformation - which is a total of 46% off discount. With that, in next time when another customer buys a $270 product, we can do 270 * 0.54 = 145.8 right away.

So, the idea here is, a linear transformation of another linear transformation of another linear transformation ( * 0.8 * 0.75 * 0.9 ) can be combined to one linear transformation (* 0.54).

The slide below is just showing the same idea - two linear transformations can be combined as one linear transformation. It will be clearer if you just plug in some numbers into the w’s and the b’s, which is easy because they are all just scalars.

Above is just some maths, but the key idea here is, this tells us that, no matter how many layers we build into our neural network, as long as only linear transformation exists there, it is no smarter than a network of just one layer. This is to say, we will be wasting our time training so many layers for the effect of just one layer.

This second slide of the video elaborates the idea of the first slide in simple maths, try to fully understand the second slide first, then go back to the first and see what you can/can’t make sense of. After that, if you still have questions, try to ask a more specific one, such as, which sentence did you not get?

Raymond

2 Likes

Thank you so much for your explanation Raymond. I just don’t understand one line of the equation when Mr. Ng equates w1 + b1 + b1 = b. Can you please explain that. Thank you

1 Like

image

image

The above two equations, respectively, group terms that are associated with x and terms that are not associated with x.

w’s and b’s are no difference - they are just trainable weights. The two groupings helps us recognize the following:

\text{some trainable weights} \times x + \text{some trainable weights}

and tuning those two groups of trainable weights is no different from training two trainable weights: wx + b

Cheers,
Raymond

1 Like

Hi @rmwkwok
Similarly, if we were to use only sigmoid function on every unit of the neural network, how would the neural network be different from a logistic regression of one layer? Doesn’t the composition of sigmoid functions also give a sigmoid function? Could you please clear my doubt. Thank you.

2 Likes

Hi @VICTORIA_JOSE,

The lecture has provided us a way to prove it -

You see, after the maths, the outcome is just

  • one linear equation of
  • the input variable x.

I split the sentence and put the two keys as bullet points.

Now, you can prove it yourself, whether or not, after replacing all linear activations with sigmoid activations, the outcome is also

  • one sigmoid-activated linear equation of
  • the input variable x.

If you can’t, then you can’t. If you can, then please show me the steps if you would like to continue this discussion with me :wink:

If you can’t, then you have answered why two layers of sigmoid activated layers are not the same as just one layer.

Cheers,
Raymond

1 Like

Thank you so much for the explanation.

1 Like