When I set the Learning Rate to 3.0, the network behaves chaotically with the Tanh activation function (loss explodes, connections oscillate wildly). However, surprisingly, with the Sigmoid activation function and the same high learning rate (3.0), the network manages to converge and achieve perfect classification.
Why would Sigmoid be more stable or ‘forgiving’ than Tanh in a scenario with an extremely high learning rate?
The partial derivative of tanh() and sigmoid() are substantially different. That (and the magnitude of the feature values) determines whether the cost function is going to converge with any fixed learning rate.