Explaining how gradients are propagated through deep networks

When I went through this course and as I experimented with training different networks and had problems here and there, I started to wonder more about what factors go into the gradients of the weights at any given layer, and how those factors can help to explain problems. In particular, I wanted to understand that in the context of deep neural networks.

After a lot of work, and many failed attempts with the math, I’ve finally been able to write that up.

I wanted to share that here in case it is useful to anyone else:

To summarise, I found that the gradients of the weights at any layer are influenced by:

  • the input data, X

  • the mean prediction error, (Ŷ — Y)/n

  • the weights of all layers except the target layer (the weights of the target layer do have some effect, but it’s only indirect).

  • the pattern of unit activations at every layer including the target layer

  • the biases of all earlier layers, but not of the target layer or later layers

Additionally, of those influences:

  • they each have (the potential for) equal effect relative to the others (though layer-to-layer differences in the various attenuation/vanishing/explosion effects can shift this)
  • the weights have a linear component plus a non-linear component that attenuates the gradients (never amplifies them) in proportion to the percentage of inactive units across the network
  • the mean values of the weights can have a strong vanishing or exploding effect to the gradients if it is either far from 1.0 or if there are many layers.

Lot’s more in the blog post. Let me know if I’ve messed anything up.

Thanks for your work on this.

Thanks for sharing!