Is the formula for non-random weight initialization the same with dropout?

Hi Everyone,

When learning about non-random weight initialization to prevent vanishing/exploding gradients, we see that W[l] = np.random.randn([shape of layer L]) * np.sqrt(1/n[l-1]).

Does this initialization work with dropout regularization, seeing as n[l-1], the number of nodes in layer L-1 connected to each neuron in layer L, changes each time?

Hi, @jeffreywang.

Excellent question.

It does seem like the noise introduced by Dropout could affect variance propagation, and there are initialization strategies that take this into account, but I wouldn’t be able to recommend a specific one.

Hopefully someone else can shed more light on this :slight_smile: