I wonder if someone could explain why d3 is np.random.rand(a3.shape[0],a3.shape[1]) followed by d3 times a3 rather than d3 =np.random.rand(a3.shape[0],1) and then d3 times a3 please. In other words, if the first neural in layer 3 drops out, I assume first row of resulting A3 is all 0 rather than a mix of 1 and 0 with some probability. Could someone point out where I go wrong please? Thank you in advance.
Hey @X0450,
First of all, I would like to point out that in the assignment, the dropout needs to be added in the first and second layers only. Do review the note written in Exercise 3 once. Now, let’s understand your query with the help of the example provided in the kernel itself. Here, I will be considering the example of adding dropout in the first layer. Consider the shapes of the variables as follows:
X → (3, m); where m denotes the batch size
W1 → (2, 3)
b1 → (2, 1)
Z → (2, m)
A1 → (2, m)
Now, the key-point to note here is that a single column in A1 denotes a single example, and what we want is to apply a unique dropout to each of the examples, i.e., for examples in a single batch, we want the neurons to be turned-off differently.
Now, when we take D1 = np.random.rand(A1.shape[0], A1.shape[1]), it ensures the desired effect, i.e., different neurons are turned-off for different examples, even in a single batch. But if we take D1 = np.random.rand(A1.shape[0], 1), then D1 will have a shape (2, 1), and it will be broadcasted to (2, m), so that D1 and A1 can be multiplied. In this case, the same neurons are turned off for each of the examples, which is different from the desired outcome. I hope this helps.
This is an interesting question. Elemento has given a great answer, but there are some earlier threads which also discuss this point. The idea is that either solution is actually reasonable: using the same mask for each sample in the minibatch or using a unique pattern for each sample. Here’s a thread on which a fellow student does some experiments comparing the two methods and shows the results.