Dropout is a regularization technique. On each iteration, we randomly shut down some units on each layer and don’t use those units in both forward propagation and back-propagation. Since the units that will be dropped out on each iteration will be random, the learning algorithm will have no idea which units will be shutted down on every iteration.

This unit which I’m circling purple. It can’t rely on any one feature because any one of its own inputs could go away at random. The ways were reluctant to put too much weight on any one input because it could go away.

So this unit will be more motivated to spread out this ways and give you a little bit of weight to each of the four inputs to this unit.
(force the learning algorithm to spread out the weights and not focus on some specific input units)

Question 1:
For example, during the Iteration 1, the first input and third input are eliminated, values of input feature are a[1,1] = 0 and a[3,1] = 0.

During the iteration 1, one step of gradient descent is performed, and the weights of w[1,1], w[2,1], w[3,1], and w[4,1] are calculated.

How does the learning algorithm spread out and “give a little bit” of weight to each of the four inputs to this unit?

Question 2:
If i prefer to do so, how should i put too much weight on any one input since all the weights are calculated by using the gradient descent?

The key point is the one made in the explanatory paragraph above: the whole point is that we drop different neurons at each level on each iteration. So in the case of the particular iteration you show, the backpropagation will not update the weights for the two “zapped” neurons. But on the next iteration, those probably will get updated. So it has the general effect of weakening the dependence of the circled purple neuron on each of its possible inputs. That’s what Prof Ng means by “spreading out weights”.

Not sure I understand the question here, but the point is you don’t do anything to emphasize or de-emphasize any particular weight. Your only role is to choose all the hyperparameters (number of layers, number of neurons in each layer, activation functions and the dropout “keep probability” for each layer on which you apply dropout and so forth). Then you just run the training and see what happens. If you don’t get good enough results, then you need to analyze the nature of the problem and decide which of the various possible hyperparameters to tweak. E.g. if you still have overfitting, but only slightly less than without dropout, then maybe you need a lower “keep probability”. But if instead you get underfitting on the training data (high bias), then that means you “overdid it” with too low a keep probability. And so on …

Sorry, there are no easy “one size fits all” answers here, which is basically the high level theme of Course 2 and Course 3.

BTW there have been lots of threads about dropout over time.

Here’s one that discusses the point that the way we implement it, each sample is handled differently in each minibatch.

Here’s one that explains the point of the “inverted” dropout. You have to read all the way to the end of the thread to see references to the fact that the “inverted” algorithm is actually a more sophisticated one that actually wasn’t in the original dropout paper from Hinton’s group.

Here’s one that discusses the question “if dropout works, then doesn’t that mean there’s a smaller network I could define that would have the same effect”.