I don’t understand why we have to/must use sigmoid function for output layer a^L.
Although in the lecture, the instructor says that the reason is sigmoid has the output range [0, 1] while it is [-1, 1] for tanh. However, I don’t think that this is the root of the issue since we can simply code >= 0 = 1 and <0 = 0
Hello Phan and welcome to the community! I hope you will find all the answers you need for your progress in the specialization.
I invite you first to specify what lesson and exercise cause you troubles even though here it looks like a general question.
So as you said, sigmoid outputs a value in [0,1], which is the format you want for the probabilistic interpretation of this result. You could try to do differently, if I understood what you said, you would like to apply a tanh function and rescale linearly to [0, 1], and it would probably work.
If you do that (a tanh followed by a linear rescaling) then during training your model will learn different weights in the hidden layers. However, the sigmoid function does in one step what you would have done in two steps with the tanh/rescaling process.
I hope that helps, and please tell if I missunderstood your question