Can anyone explain me why we take average of cost function in the advertising and sales problem of linear regression in the video “optimization using gradient descent least squares with multiple observations” , wouldnt the cost function just be sum of all costs

Can you provide a time mark within that video?

Taking the average allows the cost function to be more interpretable and helps in making the gradient descent update step size consistent regardless of the data size. It ensures that the learning learning rate which is a key parameter in gradient descent has a similar effect on the cost function, regardless of the size of the dataset. This makes it easier to set and fine-tune hyperparameters during the training process.

4:00