Clarification on Cost Discrepancy Between L2 Regularization and Dropout

Results Observed:

I noticed an interesting pattern while comparing L2 regularization and dropout in a neural network. Despite their cost functions behaving differently, both methods achieve similar accuracy.

Method Cost (Final) Training Accuracy Test Accuracy
3-layer NN with L2 Regularization ~0.2678 94% 93%
3-layer NN with Dropout ~0.0605 93% 95%

Observation:

The L2-regularized model has a much higher final cost (~0.2678) compared to the dropout model (~0.0605), yet their test accuracy is quite similar.

Why does the L2-regularized model have a significantly higher cost despite achieving accuracy comparable to the dropout model?

1 Like

Isn’t it simply because in the case of L2 regularization, COST includes the values of the Frobenius norms of all the weight matrices, i.e. the value of COST everywhere in the space of weights is higher by a large-ish sum of all the w^{2}. One is adding a high-dimensional parabola around the “all weights 0” point.

This is not the case with the NN that is used with dropout, so COST is correspondingly lower.

Note that this doesn’t really matter as COST is just an arbitrary value that tells us how good we are currently doing (relative to other solutions), which is why its formula can be chosen rather freely.

4 Likes

As David points out, it’s a different function. When you do Dropout, there is no additional regularization term added to the J value. If you wanted to get a more accurate evaluation of the results in both cases, try computing the J value of both models with regularization disabled. Meaning after you complete the training in both cases, remove the L2 term in that case and set keep_prob = 1 in the dropout case. Now compute the base cost again and compare.

The accuracy is what we really use to evaluate the results and that tells the real story.

As David also points out, the J value itself is not really that meaningful. E.g. it’s certainly not portable, meaning comparing the J value from two different models tells you nothing. Also note that lower J doesn’t necessarily mean better accuracy for any given model, because accuracy is quantized.

Just to make sure I’m not giving the wrong impression here by saying J isn’t that meaningful, the loss/cost function itself is absolutely critical to everything we do here: the derivatives of that function control the training. So the function L matters hugely, but just knowing the scalar value of J doesn’t really tell you that much. Lower is better, but you need accuracy as the real metric to know how much better.

4 Likes

As you mentioned, the value J doesn’t really provide any information since it’s just an arbitrary number. But before reading your explanation, I was wondering why don’t we use J as a metric to see after how many epochs the loss decreases, without actually training the model? This would save a lot of computation. We could also initialize the weights using the same technique as in the network (e.g., He for ReLU or Xavier for Tanh/Sigmoid) just for the cost function and then run it for thousands of epochs.

You also mentioned that the L value matters more than J. Why is that? Why are we more interested in the loss of a single example rather than understanding the overall loss J of the model?

@Mushi As already answers by @dtonhofer and @paulinpaloalto
The key reason behind this difference in final cost values despite similar accuracy is how L2 regularization and dropout affect the cost function and network optimization.
If you are interested in how these two preform you can follow the research articles on An Analysis of the Regularization Between L2 and Dropout in Single Hidden Layer Neural Network | IEEE Conference Publication | IEEE Xplore
In which Table 1 and Table 2 shows how these two perform.
Though the paper focus on Single Hidden Layer Neural Network.

I was wondering why don’t we use J as a metric to see after how many epochs the loss decreases, without actually training the model?

Well, if you don’t train the model, all the weights stay at the value they had after initialization, so processing the examples will never change anything, and the accuracy of an epoch and J over an epoch will stay constant.

The point is exactly in “training the network” so that J may decrease.

You also mentioned that the L value matters more than J. Why is that? Why are we more interested in the loss of a single example rather than understanding the overall loss J of the model?

It “matters more” because it is rather arbitrary how one computea the COST (cost over all examples in the batch) from the array LOSS (loss of each example in the batch) during feed-forward batch processing. In the course, we use COST = mean(LOSS), which is simple, but one could choose another function than mean(), as long as it increases with increasing LOSS and is easy to differentiate. (the latter so that one may build computationally easy back-propagation).

If one is doing stochastic gradient descent (that is, one adjusts the weights based on the loss of each example rather than after having collected LOSS for the whole batch and computed COST) there isn’t even a COST, just the per-example loss. From that loss one computes a gradient for the example at hand and one adjusts the weights accordingly. A cost can be defined as a running average over the per-example loss to monitor progress, but it doesn’t have influence on backpropagation.

1 Like

The relationship between L and J is straightforward: J is the mean of the L values over the samples in the full epoch. Maybe you can consider that “arbitrary”, but that is how it is done in every case that I’ve seen so far.

Sorry, what I said about L being more important than J needs to be a bit more specific: what I should have said is that it is the functions L and J that matter, not the particular J values. The functions drive back propagation through their derivatives and of course the straightforward relationship between J and L means that the relationship between the derivatives is also straightforward: the derivative of the average is the average of the derivatives. Think about it for a second and that should make sense. The thing that is not that important is the actual scalar value of J at any particular point. If you tell me that it’s 42, there isn’t really much that I can conclude from that other than that 41 would probably be better. The values of J are not “portable” in that you can’t compare them between different models, so they have no value beyond being a cheap proxy for how convergence is working. And as mentioned above the relationship between accuracy and cost is not as simple as you’d expect because accuracy is quantized. Accuracy is what we really care about, but that’s a bit more expensive to compute on every iteration. We can compute accuracy every 100 or 1000 or 5000 iterations and get a more realistic picture of where we are.

1 Like

Actually, now that I think \epsilon harder about what David is saying there, there can be an arbitrary part of the relationship between L and J in some cases, e.g. the L2 term when we are doing L2 regularization. That is added to the mean of the L values to get the final J value that is used for computing gradients in that case. There are other forms of regularization that add different additional terms, e.g. L1 or “Lasso” regularization which adds a term based on the sums of the absolute values of the weights. But the “base” unregularized cost J is just the mean of L over the samples.

One other side question that is worth mentioning is that people are frequently curious about why the L2 regularization term is scaled by \frac {1}{m}. That makes it look a bit like an average, but the sum is not over the samples of course. I don’t know the answer and Prof Ng does not discuss this in the DLS C2 lectures (at least that I can recall), but one theory would be that the purpose is to make the value of the hyperparameter \lambda orthogonal to the dataset size. Here’s a thread which discusses that a bit more. And actually here’s a thread in which @conscell points out that Prof Ng does say more about this in the MLS lectures and does confirm the “hyperparameter orthogonality” motivation for doing the scaling that way.

1 Like

Thanks for the clarifications, Paul

Indeed it is.

But also:

If we consider AL, the output of the last layer, as the estimates of the probabilities that the input X belongs to class 0, i.e. that the true label Y = 0, then, if we bet on the true label using these probabilities, the COST is the negative log-likelihood of the probability of “betting correctly on every example in the batch”, scaled by \frac{1}{m}.

It’s hard to express, but actually very simple.

I.e. the ANN, configured with the weight and bias vector \Gamma, and seeing input X^{\{i\}} tells us:

I estimate that:

  • p(Y^{\{i\}} = 0 | X^{\{i\}}, \Gamma ) = AL^{\{i\}} and (Ă  fortiori)
  • p(Y^{\{i\}} = 1 | X^{\{i\}}, \Gamma ) = 1-AL^{\{i\}}

what say thee?

And then we say:

Okay, give me a “0” with probability AL^{\{i\}} and let’s check against the actual Y^{\{i\}} to see whether we won this round \{i\}

until we have gone through the whole batch.

So, e^{(-m * COST)} is the probability of winning this game perfectly.