Why do we do L2 Normalization before dot product, and why do we maxmin scale the output Y? In earlier classes, Dr Ng did not go through the need to scale output label values. What’s the harm of not doing either?

Its good to normalize before dot product so multiplications do not become large numbers and in gradient descent the oscillations are not huge, as well as probably less memory usage!

Minmax scaling here I think is converting the output to a categorical or discrete output, its better used in processing by computer systems in general as well as machine learning equations when regression is involved!

Thanks. So on the l2 norm of vu and vm, even though they are supposed to converge to relatively small numbers, but before convergence, without the L2 norm, it’s possible during the iteration we have large values, so this L2 norm ensures at each iteration, we have pretty consistently small value to deal with?

And on the output Y, I’m not sure I understand you. Here the values are not categorical, they are the user ratings of movies (0.5 to 5.0 with a step of 0.5) so the Y_hats are between 0.5 and 5.0, and Y_hats are not supposed to be discrete - do you mean that the MinMaxScaler is to confine the Y_hats in such a way that they never go beyond the upper and lower bounds of 5.0 and 0.5?

We specifically need L2 norm there because it makes sure that, after the dot product, the result is between -1 and 1 , which is required by the min-max-scaled y.