The question is “You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer?”
The answer is “sigmoid”
‘tanh’ was also among the choices.
I am wondering why it is sigmoid, not tanh. I heard tanh is better than sigmoid in a lot of cases. And I thought I can use the output value of tanh( -1~ 0) as the classified label 0 and the output value of tanh( 0~ 1) as the classified label 1.
But the point is that the loss function (log loss) is tied to the sigmoid activation function in that it requires output values between 0 and 1. In other words, you can’t just arbitrarily change the output activation by itself: you need to adjust the loss function as well. So what loss function would you use if tanh is your output activation with a range of (-1,1)?
BTW it turns out you could scale and shift tanh so that it is the same as sigmoid. They are very closely related mathematically.
1 Like
Thank you for your answer.