Week 1: Weight Initialization - Effect on Activations vs Gradients

I have properly understood the catastrophic effect of poor weight initialization on the activation outputs of deep neural network layers. How slightly high (or low) weight values can cause the activations to explode (or vanish) within the deep layers in the forward pass.

But, what I’m not able to conceptualize is that how activation outputs of the forward pass will cause the back prop gradients to explode or vanish. I know that the A[l] matrices are used in computing these gradients, but still, I’m not feeling confident. An explanation would be really helpful.

Also, while explaining the effect of the weight initialization on activation outputs, we just considered a single (and the first) forward pass. There’s a possibility that after a few backpropagation weight updations, this problem tends to solve itself in the process of learning the distribution of the training data.

Thank you very much for your time.

Aman Kumar

Hi, @aman_kumar.

When computingimage, if you unroll \delta^l (dZ) you get:
image (source)

Now the reasoning is analogous to the one shown in the lecture for the activations.

If the gradients explode and you get NaN or inf values, you may have to restart training. If they vanish, parameter updates will be very small. Generally speaking, I don’t think the problem tends to solve itself.

This was implemented in course 1, in case you want to run your own experiments :slight_smile:

You may also find this interesting.