Hi, the explained reason for using the sigmoid function for the output layer, where the required Yhat is either 0 or 1, was that the sigmoid output is beween 0 and 1, whereas tanh is between -1 and 1.

What I can’t understand is: In the prediction function of the week 2 assignment, we set the prediction to 1 if the probability was greater than **0.5**, and set the prediction to 0 otherwise.

Then, why not choose tanh for the output layer, and simply set the prediction to 1 if probability is greater than **0**, and set the prediction to 0 otherwise?

Hi @Doron_Modan ,

The threshold is important when working on a model that outputs a binary predication, because we need to have probability at the minimum to justify it to be 1, otherwise, the model will have many false predictions.

Yes, I guess you could think about doing it that way, but that’s not the only thing you have to deal with, right? How do you define your loss function in that case? With the sigmoid outputs looking like probabilities, that gives “cross entropy” as the natural loss function.

But other way to ask the question is why do you think your method would be better? Also note that it turns out that tanh and sigmoid are very closely related.

Brilliant explanaton, in that thread, thank you. Could you please explain the term “cross entropy” in simple words?

I don’t know if there are any “simple words” that will suffice here, but “cross entropy loss” (also sometimes called “log loss”) is a function that is derived from the concept of “estimating maximum likelihood” in statistics. This has been around at least since the days of Leonhard Euler, so it’s not something new created just for machine learning that just popped into somebody’s mind. Prof Ng explains it in the Week 2 lectures and here’s a thread from Mentor Raymond that gives a really nice explanation. Sorry, but as warned above, neither of those probably qualifies as “simple words”.

Here’s another thread that discusses this and shows some graphs.

Thank you! I am thrilled to be going to read the two threads. I understand the concept of measuring the average distance between the probabilities and the real values. I am curious though about the term *entropy*.

Entropy has a definition in Physics (Thermodynamics), but in information theory they have their own (conceptually related) definition.