The calculation for gradient is : “g = d/dR Loss = (2/m)(X^T(XR-Y))”
There is no clear explanation given for why g = (2/m)(X^T(XR-Y)).
Specifically, why we got the transpose term multiplied to ‘XR-Y’.
Can someone provide a more elaborate explanation?
Hi Tejas_Joshi.
There is Optional reading towards the Week 4 end. It shows how you arrive to this formula.
The transposition is for vectorized version - to get the magnitude of the gradient you don’t have to loop through every parameter multiply and sum, you just dot product two vectors (your transposed inputs with the difference between predicted and actual outcomes).