Using tanh vs. sigmoid for output layer

Doron_Modan · October 19, 2022, 2:19pm

Hi, the explained reason for using the sigmoid function for the output layer, where the required Yhat is either 0 or 1, was that the sigmoid output is beween 0 and 1, whereas tanh is between -1 and 1.
What I can’t understand is: In the prediction function of the week 2 assignment, we set the prediction to 1 if the probability was greater than 0.5, and set the prediction to 0 otherwise.
Then, why not choose tanh for the output layer, and simply set the prediction to 1 if probability is greater than 0, and set the prediction to 0 otherwise?

Kic · October 19, 2022, 2:52pm

Hi @Doron_Modan ,

The threshold is important when working on a model that outputs a binary predication, because we need to have probability at the minimum to justify it to be 1, otherwise, the model will have many false predictions.

paulinpaloalto · October 19, 2022, 3:28pm

Yes, I guess you could think about doing it that way, but that’s not the only thing you have to deal with, right? How do you define your loss function in that case? With the sigmoid outputs looking like probabilities, that gives “cross entropy” as the natural loss function.

But other way to ask the question is why do you think your method would be better? Also note that it turns out that tanh and sigmoid are very closely related.

Doron_Modan · October 19, 2022, 6:22pm

Brilliant explanaton, in that thread, thank you. Could you please explain the term “cross entropy” in simple words?

paulinpaloalto · October 19, 2022, 10:25pm

I don’t know if there are any “simple words” that will suffice here, but “cross entropy loss” (also sometimes called “log loss”) is a function that is derived from the concept of “estimating maximum likelihood” in statistics. This has been around at least since the days of Leonhard Euler, so it’s not something new created just for machine learning that just popped into somebody’s mind. Prof Ng explains it in the Week 2 lectures and here’s a thread from Mentor Raymond that gives a really nice explanation. Sorry, but as warned above, neither of those probably qualifies as “simple words”.

Here’s another thread that discusses this and shows some graphs.

Doron_Modan · October 20, 2022, 10:00am

Thank you! I am thrilled to be going to read the two threads. I understand the concept of measuring the average distance between the probabilities and the real values. I am curious though about the term entropy.

paulinpaloalto · October 20, 2022, 3:37pm

Entropy has a definition in Physics (Thermodynamics), but in information theory they have their own (conceptually related) definition.

Topic		Replies	Views
Is Tanh better than sigmoid? Neural Networks and Deep Learning	5	671	May 11, 2023
Why not use tanh-func for output a^L? Neural Networks and Deep Learning	1	512	August 5, 2021
Why is sigmoid activation function better for binary classification than the tanh activation function Improving Deep Neural Networks: Hyperparameter tun	2	683	September 21, 2021
Question about c1w3 quiz Neural Networks and Deep Learning	2	699	October 30, 2021
Tanh and sigmoid are closely related Neural Networks and Deep Learning	3	873	March 3, 2022

Using tanh vs. sigmoid for output layer

Related topics