The question is “You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?”

The answer is “sigmoid”

‘tanh’ was also among the choices.

I am wondering why it is sigmoid, not tanh. I heard tanh is better than sigmoid in a lot of cases. And I thought I can use the output value of tanh( -1~ 0) as the classified label 0 and the output value of tanh( 0~ 1) as the classified label 1.

But the point is that the loss function (log loss) is tied to the *sigmoid* activation function in that it requires output values between 0 and 1. In other words, you can’t just arbitrarily change the output activation by itself: you need to adjust the loss function as well. So what loss function would you use if *tanh* is your output activation with a range of (-1,1)?

BTW it turns out you could scale and shift *tanh* so that it is the same as *sigmoid*. They are very closely related mathematically.

1 Like

Thank you for your answer.