Can you explain me a bit this part? Why is it?

The Frobenius Norm is the square root of the sum of the squares of the elements, right? Have you tried taking the derivative of that?

To reduce it to the univariate case, just to make the conceptual point, consider the following two functions:

f(z) = z^2

g(z) = \sqrt{z}

What are the derivatives of those two functions?

**2z and (1/2) z-1/2**

Still can’t understand your thought. And why is derivative of root is important?

Well, look at those two expressions: which one would you rather deal with? We’re talking about running a lot of iterations of gradient descent, right? Just to be clear, let’s write the second expression in normal notation:

g'(z) = \displaystyle \frac {1}{2 * \sqrt{z}}

Note that computing square roots is very expensive, because you need to run an approximation algorithm just for each square root, right? The computational expense matters here, because we’re going to be running potentially tens of thousands of iterations. 2 * z on the other hand is very cheap to compute.

And the overall point which they also made on that slide is that if you minimize z, you’ve also minimized \sqrt{z}, right? So why incur a bunch of extra compute cost that doesn’t do you any good?

oh, do you mean, that in this case we should do three times more operations when **2∗z** when we compute gradients?

So, by this reason better to compute average of squares beforehand?

The point is that the gradient computations are determined by which function you use for the cost, since the gradients are derivatives of the cost. Either function will give you the same result in terms of the solution you learn by training, but computing square roots is much more expensive. So why not use the cheaper function if it gives the same result? Why would you want your training to be more costly in terms of computation resources than it needs to be? That matters, right?

That is literally what I said in my previous response: