The problem of expolding/vanishing

HI
why would i have the problem of exploding/vanishing in the deep network if i use sigmoid or tanh activation functions? These types of functions doesn’t matter how large your input it will reset it to value between 0 and 1 so it doesn’t affect on the next layer. i think the problem with exploding/vanishing would occur in the activation functions that doesn’t squash the input to values between 0 and 1 ,like the relu. the only thing affect is when we have many neurons per layer the value of Z will be big.

The “exploding” or “vanishing” behavior is not referring to the actual values of the activation functions: it is referring to the gradients of the parameters with respect to the cost. Of course (by the Chain Rule) those are the products of lots of gradients, including the gradients of the activation functions. Notice that the “tails” of the sigmoid and tanh functions flatten out pretty agressively as |z| increases. That means the derivatives are close to 0. The product of numbers << 1 gets smaller, right? E.g. 0.1 * 0.1 = 0.01. Whereas the products of numbers > 1 get larger.

If the gradients are close to 0 (“vanishing”) that means Gradient Descent is “stuck” and can’t learn or at least can’t learn very fast. If the gradients are really large numbers, then you get divergence or oscillation instead of convergence.

2 Likes

that is great

. thank you