When doing the lab for week 3, I realized that sigmoid function: 1 over (1 + e^-x) is nearly the same as adding one to the hyperbolic tangent function and dividing it all by 2. I put both functions into desmos and it’s nearly the same thing.
tanh() is used in some situation - such as when the output is a real number (instead of a classification).
The gradients of sigmoid() are slightly easier to compute, mathematically.
Oh, okay!
Is this the same reason there is an e in the denominator of the sigmoid function instead of any other number since the derivative is easier to get?
One characteristic of the sigmoid() function is that it’s partial derivative is very easy to compute.
Also note that tanh and sigmoid are very closely related mathematically. The primary reason to choose one over the other is what you need the range of the function to be: for the output of a binary classifier, you need (0,1), but for a hidden layer in a network, you may find the range (-1,1) gives better convergence. Or not. There is no “one size fits all” solution for hidden layer activations.
Thanks for the link