I have a question about the dropout. In the lecture, it mentions that one can show that the dropout is similar/equivalent to adaptive L2 regularization. However, in each epoch, the loss function is no longer an invariant quantity. How can we ensure that the error will eventually converge? It’s not clear to me.

There is never any guarantee that gradient descent will converge with any particular choice of hyperparameters, which includes your choice of regularization method and values associated with that (\lambda or the dropout probabilities). So you try. And when/if it fails, you adjust hyperparameters and try again. The general method is described by Prof Ng in Week 3 of Course 2.

The other point worth making here is something that Prof Ng discusses in the dropout lectures:

The purpose of regularization is to address overfitting problems. Meaning that you already have come up with a set of hyperparameter choices such that the that training converges to a solution. The problem is just that the solution overfits. So if you already have convergence, adding regularization is probably not going to disturb that aspect too much. But the solution surfaces here are pretty complex, so I guess anything is still possible and you may need to adjust other hyperparameters in the process. No guarantees, but you probably want to start with relatively “mild” values for the dropout probability (i.e. close to 1) and tune from there.