Dropout and Cross Validation


So I know dropout and CV are obviously not the same concept-- At least insofar as you are, ‘in a sense’, sampling different nodes each time rather than segments of the data set as with traditional machine learning CV–

However I am struck a bit as to how they seem somewhat similar-- Is there any relation ?

Further, looking through the contents of the upcoming weeks I don’t see any mention of CV. I wondered if this process is not traditionally used with Neural Networks, and, if not, why not ?

Just to be clear, what exactly are you referring to by “CV”?

Cross validation.

The course doesn’t discuss the traditional “leave-N-out” cross validation.

My experience only:
Leave-one-out is typically used to stretch a set of data that is otherwise too small, by only using one example at a time as the validation set.

Deep learning usually involves large data sets, so stretching the validation set isn’t necessary.

Yes, dropout is a form of regularization. Regularization is part of training, not directly part of Cross Validation. There are three phases of training and testing your model:

  1. Training
  2. Test on Cross Validation set. Based on the results you may need to change hyperparameters which would include which type of regularization you are using in the training and the parameters you choose (e.g. dropout rate in the case you are using dropout). Of course hyperparameters also involve lots of other things (number of layers, number of neurons, activation functions …). Repeat 1) and 2) until you are satisfied.
  3. Then finally test your model produced by steps 1 and 2 on your “test dataset” to get the final evaluation of the performance.

Note that regularization is never used anywhere but during training. It causes the model you get as a result of training to be different, in a way that you hope is “better” according to your metrics. But in any “prediction” mode, which means using the model (as opposed to training it), there is no regularization active.

Oh, I see also looking back on the course notes Prof. Ng uses a slightly different terminology than what I studied, calling what I know as your CV set as the ‘Dev set’.

Though using that language I also learned (under ‘traditional CV’), you might not just have one Dev set, but many-- Or you keep breaking you train set down into multiple randomly selected train/dev sets and continuing refining your model.

As said, in part I think it was the difference in terminology I missed.

I also learned (under ‘traditional CV’), you might not just have one Dev set, but many

What you are referring to is called K-Fold Cross Validation.

I found a good article that explains it pretty well.


1 Like

Prof Ng calls the data set used in cross validation either the “dev set” or the “c/v set” interchangably, but he calls the process “cross validation”. This was covered in some detail in DLS C2 W1 and W2. He also gets into more subtleties in DLS C3. My understanding is that he means something more straightforward than “K-fold CV”.

So can MSE then still be calculated in the same way (with cross validation techniques – and here, yes, K-Fold) as it is with traditional ML models ? I.e:

\hat{\mbox{MSE}}_b(\lambda) = \frac{1}{M}\sum_{i=1}^M \left(\hat{y}_i^b(\lambda) - y_i^b\right)^2

\hat{\mbox{MSE}}(\lambda) = \frac{1}{K} \sum_{b=1}^K \hat{\mbox{MSE}}_b(\lambda)

Also am only on course 2 so far so was just trying to think ahead to make sure I understand.

Hello @Nevermnd

If your question is to calculate MSE is same way as in K-fold CV than compared to cross-validation(dev set) then answer is No.

In K-fold CV, dataset is divided into k subsets or folds. The model is trained and evaluated k times, using a different fold as the validation set each time. Performance metrics from each fold are averaged to estimate the model’s generalization performance (basically all the samples are used in model training) where as in Cross validation, MSE would quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.


Thanks @Deepti_Prasad.

I mean for reference this is where I learned how to do CV (and K-Fold CV):

So I guess part of my remaining questions here are:

  1. Can you do K-Fold CV on a Neural Nets (and is it practical or is this seen as not needed because the architectures are quite different) ?

  2. What then is the means/formula for finding MSE on the Dev set ? It isn’t just the same as the first ‘single case’ formula I posted above ?


Yes you can do and it is practical too in approach.

if we are doing K-fold CV data split

MSE = (1/n)*Σ(yi – f(xi))2
est MSE = (1/k)*ΣMSEi
here k is number of folds or subsets

Refer the below link if looking for more information of k-Fold data split


1 Like

Hello @Nevermnd!

In addition to @Deepti_Prasad’s response,

You can. CV does not care what kind of model it is.

However, let’s take a closer look at the case when you have a very large dataset (for training a very large model).

First, I suggest you to watch these two videos in Course 3 (they don’t really require previous lectures)

Then, you will have seen we may use as little as 1% of the data for validation in the case of very large dataset. Then, we should think about, how would that cope with K-Fold? How much different would the training sets be from one fold to another fold? If the difference is small, then how much difference could we expect in the models from one fold to another? If the difference is not expected to be a lot, we might be overkilling to use CV.

I am raising these as questions because I am not promoting to not use CV for very large dataset, but since you consider practicality, these questions should shed some light and might provide a possible cause for, in the future, when you see others not using CV with very large dataset.




Yes, I would agree. I myself in an earlier course when trying to train other algorithms (GLM, KNN, etc) on the 10M Movielens data set.

I tried upgrading my RAM 4 times-- from 16 GB, to 32 to 64, finally to 128-- And still any attempt would just saturate my system, and after a day or two, just crash with no results.

I ended up attempting something similar to K-Fold by running a numer of randomly sampled data sets (I ended up trying both with and without replacement). Even then I could only do perhaps up to 250,000 samples per set, and that still took 2 - 3 hours each.

In the end I went with SVD which was way more efficent for the task, and could even also handle the whole data set at once.

So I totally get what you are saying and agree.

1 Like

It’s nice to have 128GB, @Nevermnd ! I have only 16GB so sometimes I need to find workarounds when RAM is not enough, such as not to use 64bit float but 32bit, and, if it is tensorflow, even using mixed preicison of 16bit and 32bit. This is a quick way to reduce memory use. There might be other ways but it is going to be case-by-case.


PS: You were tagging someone else. I am @rmwkwok

@rmwkwok if so, ooops (!). I know who you are Raymond, but was a bit in a rush posting this morning so sorry if there was a mistake.

With said project I was actually working in R (I’m looking forward to the upcoming TensorFlow classes in this series soon).

At least with R one interesting I noticed, especially after enabling all the multiprocessing libraries/options (I’m running an AMD 8 core Ryzen-- I also have a GTX I bought specifically for data science-- not gaming-- but I haven’t been able to use it because R is not good for CUDA support)-- Is that was always my memory that was my bottleneck, and not the actual calculations.

So in said assignment, at times, I’d hit, at most 30% utilization on all cores-- but it would eat all 128 GB RAM for breakfast and never let up.

Personally I think this is one of the drawbacks of R-- Your general math operations/procedures, including for-loops, are pretty fast so you don’t have to worry about that.

However it really does not (at least as far as I am aware) have great ‘paging’ support. Or that is being able to try and save chunks of the problem to the HD as it goes along.

Instead it tries to hold everything in memory, which of course is the fastest way, yet if your dataset/problem is really big, IMHO, you will run into issues.

1 Like

Yes, it is definitely going to be problematic, but memory is also a common pain point, so it is likely to find workarounds. Unfortunately I haven’t used R so I can’t say anything about that, but if you will be using Python, depending on which package/modeling-approach it is, there may be ways other than expanding RAM - although there may be tradeoff (precision, data dimensions, batch sizes…), the key should be that it can work and lead us to a proved, viable model with whatever computational resources available.