Understanding Weight Propagation in Deep Networks and Its Effect on Gradients

Q: When calculating weight propagation in deep neural networks, I found that weights get squared as they pass through layers. For example, if the weight matrix is 1.5 along the diagonal, after two layers, the activations involve 1.521.5^21.52. Is this correct in the context of vanishing and exploding gradients?

Are you talking about the weights or the gradients of the weights? The two cases are different.

Note that weights are real numbers, meaning they can be both positive and negative. So just because their absolute values are > 1, that doesn’t mean that the values will accumulate. We are doing linear combinations followed by non-linear activation functions at each layer. The behavior is also affected by the choice of activation function, of course. tanh or sigmoid will have absolute values < 1, but ReLU will not necessarily have that.

The gradients are what they need to be to push the weights in a direction that gives a lower cost value. But when we compute gradients, we are doing the Chain Rule all the way back from the final J value at the output layer, so we are multiplying the gradients at each layer when we compute the gradients for the weight and bias values for the earlier layers in a deep network. That’s where the problems with vanishing and exploding gradients can arise. If you multiply numbers with absolute value > 1, they get bigger in absolute value. Of course when you multiply numbers with absolute value < 1, the absolute values get smaller.

1 Like