I had quite a bit of trouble with the forward_propagation_with_dropout function, until I realized I was providing the wrong dimensions for D1 and D2. My initial assumption was that the D1 and D2 would be single-dimensional vectors, eg (A2.shape, 1) , not 2-dimensional arrays.
However, given the supporting materials, I’m confused why this is the expectation in the assignment. At numerous points in the lecture and in the notebook, dropout was described as “removing hidden units”, and all illustrations pointed to dropout networks being dense. But, if D1 and D2 are 2-dimensional arrays matching the dimension of the weight matrices, isn’t Dropout actually removing edges and not nodes from the network?
If my question is unclear, please let me know, and I will try to provide a visual diagram.
It is a good question and an interesting point. I forget exactly what Prof Ng says in the lectures, but it is quite clear in the instructions for the assignment that he expects the mask values to be matrices, not vectors. So the effect is that we treat each sample in a given minibatch differently w.r.t. dropout. We are still removing nodes, but different ones for each sample. So it’s effectively as if we were doing Stochastic Gradient Descent w.r.t. the way dropout is applied. My intuition is that this makes the effect weaker for a given keep_prob value. There have been some interesting past discussions of this point, e.g. this one. Please have a look at that discussion and see if it sheds any further light.
From reading that topic, it seems to me that the poster is confused about what dimensions in the weight matrices correspond to. He’s associating the columns of the weight matrix with training examples, when in actuality the columns are the weights applied to the node outputs on the previous layer. I think he’s mixing together the notions of batches and matrix multiplication. My impression is that his intuition concurs with mine, but he’s not mis-attributing the problem that he sees.
Allow me to amend my earlier comment. I might be confused about something.
I thought that columns in the Weight matrix correspond to the node output values from the previous layer, not to different samples.
Yes, I think what you are confused about is that this has nothing to do with the weight matrices: the masks are being applied to the output of the layer after the activation function has been applied. It is A1 and A2 that are ANDed with D1 and D2, right? So you really are “zapping” (different) individual neuron outputs in individual samples. The columns of the activation matrices do represent the output neuron values for individual input samples, right?
The columns of a weight matrix are essentially meaningless. It is the rows of the weight matrices that represent the coefficients w.r.t. the inputs of that layer that give one particular neuron output value to the next layer.
Thank you for helping me find my way through this. It totally makes sense now.