Confusion about dimension of Dropout matrix

It is a good question and an interesting point. I forget exactly what Prof Ng says in the lectures, but it is quite clear in the instructions for the assignment that he expects the mask values to be matrices, not vectors. So the effect is that we treat each sample in a given minibatch differently w.r.t. dropout. We are still removing nodes, but different ones for each sample. So it’s effectively as if we were doing Stochastic Gradient Descent w.r.t. the way dropout is applied. My intuition is that this makes the effect weaker for a given keep_prob value. There have been some interesting past discussions of this point, e.g. this one. Please have a look at that discussion and see if it sheds any further light.