C2W3 quiz - understand answer to Question 8

Hi,

I got the correct answer by trial and error- it seems that 4th option below has to be marked to get a full score.

I don’t understand why 4th one has to be correct. Please help me with this. I will delete the screenshot once the question is resolved. Thank you very much!

Here is the recap of Andew’s talk.

There are two steps for Batch Normalization.

  1. normalize input (output from a previous layer) to have mean=0, variance=1
  2. shift/scale normalized data with \gamma and \beta. (\gamma and \beta are trainable.)

Sometimes, normalized data with mean=0, variance=1, may not be appropriate. For such a case, we have an option to slightly change it with \gamma and \beta.

Then, let’s look at equations of Batch normalization.
At first, we calculate mean and variance of output from a previous layer

\mu = \frac{1}{m}\sum_{i=1}^{m}z^{(i)} \\ \sigma^2 = \frac{1}{m}\sum_{i=1}^m(z^{(i)} - \mu)^2 \\

Then, normalize whole data with using the above \mu and \sigma.

z_{norm}^{(i)} = \frac{z^{(i)}- \mu}{\sqrt{\sigma^2+\epsilon}} \\ \tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta

If \gamma and \beta are following values;

\gamma = \sqrt{\sigma^2 + \epsilon} \ , \ \ \beta = \mu

Then,

\tilde{z}^{(i)} = \gamma z_{norm}^{(i)} + \beta = \sqrt{\sigma^2 + \epsilon} \cdot \frac{z^{(i)}- \mu}{\sqrt{\sigma^2+\epsilon}} + \mu = z^{(i)}

As you see, there is no change in data (output from a previous layer).
\gamma and \mu are trainable variables. So, the above may be the result of training or set intentionally. In any cases, this says that we do not need to transform any.
In general, any transformation , including normalization, may cause some losses of important characteristics. In this sense, this can be said as optimal values, I think.

I did get from the lecture that this transformation can be reduced to identity if the trainable parameters end up being \gamma = \sqrt{\sigma^2 + \epsilon} \ , \beta = \mu. But here it’s suggested that they are the optimal values. It’s represented as that’s always the case and if that were to be true, we are wasting computation by using batch norm as we are not transforming the inputs to a layer.

Practically, it will not be such a value. :slight_smile:
With Batch normalization, we can make our network stable. And, the key point is, of course, normalization. But, normalized data may cause some losses. In that case, we can slightly shift(bias)/scale it. That’s the whole thing.
If, big IF, the result of learning made \gamma and \mu to those values, then, we may be able to remove Batch normalization, as you said. But, it will not happen, I think.
The behavior of Batch normalization differs at the training time and inference time. In the training phase, it uses the mean and standard deviation of the current batch. In the inference time, it uses a moving average of the mean and standard deviation. I suppose it is quite unlikely to have \gamma = \sqrt{\sigma^2+\epsilon} and \beta=\mu. So, that is “theoretically”…

So per my understanding, it’s being shown that something can calculate the Identity function because they are trying to say that this will not make it any worse and will calculate the identity if the batch norm was not needed at all. But the option in the question indicates that optimal values are always the ones that reverse the batch norm. So in my opinion, that shouldn’t be one of the correct options.

I would mark correct it if it was written as

“The optimal values to use for \gamma and \beta can be \gamma = \sqrt{\sigma^2 + \epsilon} and \beta = \mu.”

but not as it’s given currently

“The optimal values to use for \gamma and \beta are \gamma = \sqrt{\sigma^2 + \epsilon} and \beta = \mu.”

It may depend on “optimal for what ?”

Original paper said as follows.

Indeed, by setting \gamma_{(k)} = \sqrt{Var[x_{(k)}]} and \beta^{k}=E[x^{(k)}], we could recover the original activations, if that were the optimal things to do.

It uses slightly different notations, and does not use \epsilon which is basically for numerical stability.
But, in that sense, it may not be appropriate to generally say that those are optimal values. Those are just from one aspect.

That’s exactly my point. I will wait for a course mentor to reply if they have a different interpretation. Thank you very much!

I think Nobu has given you excellent answers. Better about the details of Batch Norm than I could do for sure. You can tell the quality of the answers independent of the “mentor” badge.

There is one other issue here: there has been a problem with the mechanics of the quizzes in these courses for a while, which is some kind of bug in way either the platform (Coursera) or the course itself is set up. The problem is that the answers in some cases don’t match the question that was asked. I’ve asked the course staff to confirm whether that general problem with the quizzes has been fixed by this point or whether it is still with us, but have not heard back as yet.

Yes, I asked for a mentor as Nobu might not have access to the actual answers or ways to flag if something is not correct. My contention is that we can’t always say:

The optimal values to use for \gamma and \beta are \gamma = \sqrt{\sigma^2 + \epsilon} and \beta = \mu.

and Nobu also seems to concur with that in the last comment.

This is a line from the Batch Norm paper:

To accomplish this, we introduce, for each activation x^{(k)}, a pair of parameters \gamma^{(k)}, \beta^{(k)}, which scale and shift the normalized value:

y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} +\beta^{(k)}.

These parameters are learned along with the original model parameters, and restore the representation power of the network. Indeed, by setting \gamma^{(k)} =\sqrt{Var[x^{(k)}]} and \beta^{(k)} =E[x^{(k)}] , we could recover the original activations, if that were the optimal thing to do.

This also seems to suggest that if the \sqrt{Var[x^{(k)}]} and E[x^{(k)}] were the optimal values, then they would be learned and the network doesn’t lose the representation power because of the use of batch norm. But that doesn’t mean that they are the optimal values always.

I just pursued this as it’s either an error in the quiz or helps me in clearing my incorrect understanding, I didn’t mean to waste anyone’s time. Please feel free to close the thread. Thank you!