Feature Scaling Part 1: optimizing number of elements in dw

Greetings!

While watching the lecture I had a question: if in case of linear regression we were managed to make cost function of as a circle by using feature re-scaling techniques, then is it still worth using different values in dw vector?

Does it make sense to equate all the gradients to make a straight-line trajectory of a gradient descent (see and image below)?

No, it does not.

Regardless of the scaling of the features, each one may have a different significance in predicting the outputs. So each one needs its own weight.