In this picture. It says in reality the training error will go up as the depth increases. I wonder when the training process will stop (i.e. stop when the update in parameter is small) And how can we ensure that it is not a local minimum for the loss function (Since convolution layer does not seem like a convex function) Thanks

None of the networks here are convex. You need to use the cost curves like the ones Prof Ng is explaining here to judge when you are no longer making progress or are perhaps diverging rather than converging. With the complex solution surfaces here, there is never any guarantee of smooth monotonic convergence: you may need to adjust the various hyperparameters like learning rate, number of iterations or even to adjust the architecture of your network or apply regularization in order to get things to work.

Also note that there is never any guarantee that a given solution is not a local minimum either, but it turns out that is not a big problem in general. Prof Ng makes this comment in a couple of places in the lectures, but doesn’t go into the details. It turns out that the math is pretty deep here, but here’s a thread that points to a well known paper from Yann LeCun’s group on this question of whether there are good (achievable) solutions for this kind of optimization problem.

Got it. Thanks very much for your reply

One more question. NN(neural network) of more layers definitely contains the NN of fewer layers. So more layers will definitely give us less training loss. For example, setting extra layers to be identity function without bias, this will just becomes the NN with fewer layers. So why the training loss will raises as number of layers increase? Just because harder to train?

I mean if we train more steps for the cost function to be stable. will more layers still gives us better training loss thanks

Yes, a larger network (more layers and/or more neurons per layer) can represent a more complex function, so you would expect to eventually be able to get to a lower error and better accuracy. But it is more expensive to train (requiring more iterations and larger compute costs per iteration), because you have more parameters that need to be learned. The other important point that Prof Ng makes in the lectures is that in addition to the added training cost, a more complex network may also just give you overfitting on the training data. Meaning that there is such a thing as “overkill” and it is a balancing act and there is no “cut and dried” magic answer: you have to try some experimentation to figure out how big a network you need. Here in Course 1 there isn’t time to cover everything, but how to choose the size of your networks and how to tune other hyperparameters will be a major topic of DLS Course 2 and Course 3, so please stay tuned to learn more about all this.