So the downside of the sigmoid (and tanh) activation function is that the derivative gets close to zero out in the tails, and the reason that’s a downside is your step size is proportional to your derivative (BECAUSE you’re using steepest descent optimization) where the proportionality constant is the learning rate, so a small derivative means small step size so it takes a long time to get to the solution.
But if you were to use Gauss Newton iteration instead of steepest descent it chooses the pointwise best stepsize/learning rate for each step. Gauss Newton is a variation on Newton’s method applicable only when your minimizing the sum of squared errors across a vector of right hand sides/equations. The wonderful about GN is it only requires a Jacobian (matrix whose rows are the gradients of the equations, one row/gradient per equation). in the Coursera deep learning course we were already computing the the Jacobian but doing an (1/m)*np.sum(…) across it to flatten it into a single gradient, so computing the Jacobian is easy as pie to do. however you have to compute the PSEUDO inverse of the Jacobian, which if you tried to do as a single step would be huge and just kill your update, but…
I think you could do back propagation with Jacobians one layer at a time (you use the jacobian to compute a matrix of updates that you flatten into a vector update, rather than flattening the jacobian into a gradient that you use to compute a vector update) and because of the linear algebra step , followed by applying the activation function, at each layer you would need to do a single pinv (of the previous layer’s activations augmented with a row of ones, you merge w and b) and that scaled by constant (determined by the current activation function) can be used for the whole layer. moreover layer 1 pinv can be cached across GN iterations because X is constant.
the benefit of GN over steepest descent is it chooses a pointwise best learning rate (in a better than steepest descent direction) so you take a lot fewer steps. So because GN means you’re step size is no longer a fixed scalar multiple (learning rate) times your gradient, wouldn’t the choice of activation function matter a lot less if you were to use Gauss Newton iteration instead of steepest descent?