High variance

I understand that deep neural network overfits the data , that is works perfectly for the training set .
Can you explain why that becomes a problem for the dev set ??
I think I am asking Basically why does high variance exists .

You only know that you have an overfitting problem by comparing the prediction accuracy on the training set to the prediction accuracy on the dev set. Overfitting means that the accuracy on the training set is higher than on the dev set and that can happen in lots of different degrees: maybe you have only a mild overfitting problem or maybe you have a really severe overfitting problem.

Overfitting is a problem because what you really care about is the performance of your model on general input data, meaning data that it was not trained on. Or to put it in slightly different words: excellent performance on the training set isn’t really any use to you when you apply the model to “real” input data. It’s a necessary, but not sufficient condition that the model perform well on the training data. It is performance on the “test” data that is the better metric for success.

Why overfitting happens and what to do when it does are more complicated questions and is basically the subject of Week 1 of DLS Course 2. The high level issue is typically hyperparameter choices: e.g. you’ve used a network that is too complex, but it can also be that you don’t have a big enough training set to express the full generality of the problem you’re trying to solve. Prof Ng discusses all this in a lot of detail in the course materials for DLS C2 Week 1 and Week 2 and then again in DLS C3.

My suggestion is to just listen to all that Prof Ng says with what I said above in mind.


I have got an understanding , but i am not completely clear with this concept .
Can you explain this in terms of the cost and loss function ??
As no of layers and no of hidden units actually increases or decreases the no of variables we have to learn with the model and the variables affect the cost function .
I get that by gradient descent we minimise the value of the cost function , does that mean that a deep neural network gets the cost function more close to 0 rather than a shallow neural network .

The purpose of the cost function is just to give you a precise measure of how right or how wrong a given prediction of the network actually is. Then we use that to generate the gradients that will push the parameter values in the direction of better quality answers. But the actual L or J values by themselves aren’t useful for judging overfitting or much of anything else really. They are only useful as a proxy for whether your convergence is working or not. All you can really say is that “lower is better”, but note that because accuracy is quantized a lower loss value may not correspond to better accuracy.

The metric you really care about is prediction accuracy. That’s the actual end goal, right? The cost is just a means to get to better accuracy.


Hi @Kamal_Nayan , thanks for the question.

Well the concept of high variance arises because of the variety of data you have. Lets say you have divided your dataset into two sets : train and test. Now you train NN on the train data. As DNN’s are quite good at learning complex patterns in data so it fits your training data very well. Now your test data contains the data points which are completely new to your model (it hasn’t seen any such example before). Now lets say your data had a lot of variety (high variance), then your model would treat this example in similar way as it had learned from training data but there is a high possibility that test data does not completely follows the pattern learned from training data. Since your model has overfitted (strong belief with respect to the pattern learned from training data), it would treat test data in similar way and result in incorrect results, thus decreasing the accuracy of the model.
You can also understand this better by analysing learning curves for overfitting, where train accuracy is extremely small while test accuracy is quite high.
I hope I have answered your query. Keep up your learning spirit !
Best Wishes,

Thanks for your explanation , this really helped .
Ok a last question , basically if we have an exceptional training example (I think that’s basically what you mean by variety of data which contains exceptions as well) The DNN also learns this and therefore causes overfitting and making it less general.

@Kamal_Nayan ,
Generalisation means that your model has seen a lot of variety of data and thus the parameters of the model by the end of the training do not stick to a particular pattern. Training on variety implicates better generalisation. Although this is also decided by the number of data points you have in your training set. If you do not have enough number of data points then even if you have variety in your dataset, then too your model will fail in generalisation as it will directly try to fit specifically those points instead of getting a generalised pattern because of lack of data points.
A better analogy can be given in this way, Lets say you have been given some items to memorize. Lets say currently there are 5 items. Its easy to remember them and you can directly mug them up. Now let’s say the number of items was increased to 100. Now its not possible to memorize each of them and so the only option you have is to derive some pattern so that you are able to memorize maximum number of items. Same goes with machine learning models. I hope I have answered your question.