"Training a deeper network (for example, adding additional layers to the network) allows the network to fit more complex functions and thus almost always results in lower training error. For this question, assume we’re referring to “plain” networks. "
This expects a typical response as discussed in the video (verbatim). I would however like to believe that we still lack the sufficient level of advancement in optimization theory and that given a better optimization method the answer to this question should be TRUE.
The math behind backpropagation is well-known (chain rule). Optimization methods are also well-studied. Both theoretically and in practice, we know that the vanishing gradient problem plagues deep plain networks. Given the math we already know and have, we can instead change how our models map inputs to outputs. Residual connections are one way to improve how gradients flow (and thus how our model learns). We have attention as another concept for seq-to-seq models etc. etc. Prof Andrew Ng’s recent focus hints that data-centric AI is also important to improve our models.