Question about Gradient Descent: Modifying Update Rules and Using Derivatives

In the context of gradient descent, when updating the parameters in the optimization algorithm, we often subtract the derivative of the cost function from the current parameter values. The negative sign is typically incorporated into the learning rate, and this subtraction is consistent with moving towards minimizing the cost function. However, would it be valid to consider adding the gradient and then use the cost function itself (rather than its derivative) in the update rule, with a condition to multiply by -1 when needed? I’m trying to understand the implications of modifying the update rule in this way and whether it aligns with standard practices in optimization.

Additionally, what is the rationale behind using the derivative of the cost function in the update rule instead of the gradient itself? I’m curious about the reasoning behind using the derivative during the optimization process.

1 Like


first of all the negative sign aligns the direction of movement with the steepest decrease in the cost function.

Secondly the use of the derivative, representing the slope or rate of change of the cost function, offers local information facilitating precise adjustments to parameters

Thirdly the derivative acts as a guide, indicating how much and in what direction the parameters should be adjusted to reach a lower cost.

may be mathematically u can test, but this is my note I collected from the M4ML course calculus.

Thanks to Luis Serrano (Instructor).


These are exactly the same thing. By definition, the gradients are the partial derivatives of the cost function.

Thanks :slightly_smiling_face: