Query about gradient descent

EitharOqayli · June 28, 2022, 2:10am

in the above screenshot does anyone know why are we dividing by m in the last two steps

TMosh · June 28, 2022, 3:13am

Since the dj terms are summed over the whole training set, dividing by m normalizes them. This means the gradients will be about the same magnitude regardless of the number of examples.

In algebra, it would be similar to computing the average by summing and dividing by the number of elements.

TMosh · July 1, 2022, 5:40am

There is also a mathematical reason.

Since the cost equation has a division by m, and m is a constant with respect to theta, it carries over into the gradients when we take the partial derivative of the cost equation to get the gradient equation.

Topic		Replies	Views
Dividing by "m" in back propagation using vectorized implementation Neural Networks and Deep Learning week-3 , coursera-platform	3	461	February 19, 2024
I didn't undertand why divide by 2m is better Supervised ML: Regression and Classification week-1	9	1246	July 19, 2022
Optional Lab: Gradient Descent1 Supervised ML: Regression and Classification week-1	4	514	April 28, 2023
Normalizing the regularizer Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	481	April 28, 2023
Understanding Cost function vs Gradient descent similarities Supervised ML: Regression and Classification	1	240	July 11, 2022

Query about gradient descent

Related topics