Hi, I have finished course. it is very helpful. However I have a few questions regarding the course content.
1, The gradient descent mentioned in the class is batch gradient descent, instead of stochastic gradient descent? Please advise how to choose in different applications.
2, MSE is generally using 1/m to average all residuals, why do we use 1/2m in the cost function? what does it represent?
3, the regularization form lambda/2m * sum(w[j]**2) is a bit of like lasso regularization, but with 1/2m, why is that? for computational simplicity or special meaning involved?
Batch GD and stochastic GD are two extremes in terms of the number of samples used in each update. In practice we usually use the middle approach which is the mini-batch GD, which means you choose how many samples to use at each update based on your memory and your observation of the training process.
Without 1/2, when we compute the gradient of MSE cost, a coefficient 2 will appear. Having 1/2 lets us cancel it. This is for simplifying the computation because we don’t need to multiply it by 2.
No it doesn’t. It only has 1/2, and it is for the same reason as above: to cancel out the factor of 2 in the gradients.
Gradient descent is an optimization algorithm that is used to minimize a cost function in order to train a machine learning model. Batch gradient descent and stochastic gradient descent are two variations of this algorithm.
Batch gradient descent is an algorithm that uses the entire dataset to update the parameters of the model in each iteration. It is computationally expensive and requires a lot of memory, but it is more accurate than the other version.
Stochastic gradient descent, on the other hand, uses only one sample at a time to update the parameters of the model in each iteration. It is computationally efficient and requires less memory, but it is less accurate than batch gradient descent.
When choosing between the two versions, you should consider the size of your dataset, the computational resources you have available, and the desired level of accuracy.
The cost function for linear regression is typically defined as the mean squared error (MSE) between the predicted outputs and the true outputs. The MSE is computed as the average of the squared residuals, where the residual is the difference between the predicted output and the true output for a given sample.
Using 1/m to average the residuals is standard practice. However, in some cases, the cost function is defined as the MSE multiplied by 1/2m instead. This is done for computational simplicity, as it allows for a simpler derivation of the gradient of the cost function with respect to the parameters of the model.
The regularization term lambda/2m * sum(w[j]**2) is a form of L2 regularization, which is used to prevent overfitting by adding a penalty term to the cost function that is proportional to the magnitude of the model parameters. The regularization term tries to minimize the sum of the square of the parameters. The 1/2m term is used for the same reason as in MSE: computational simplicity. The 1/2m is used to simplify the calculations. It doesn’t change the overall effect of the regularization term on the cost function.
Another aspect of the 1/m in the regularized term is that it reduces the effect of regularization for large data sets. This is a good idea, because larger data sets have more variance, so the chances of overfitting are smaller, and less regularization is needed.