in the above screenshot does anyone know why are we dividing by m in the last two steps
Since the dj terms are summed over the whole training set, dividing by m normalizes them. This means the gradients will be about the same magnitude regardless of the number of examples.
In algebra, it would be similar to computing the average by summing and dividing by the number of elements.
2 Likes
There is also a mathematical reason.
Since the cost equation has a division by m, and m is a constant with respect to theta, it carries over into the gradients when we take the partial derivative of the cost equation to get the gradient equation.