In the context of gradient descent, when updating the parameters in the optimization algorithm, we often subtract the derivative of the cost function from the current parameter values. The negative sign is typically incorporated into the learning rate, and this subtraction is consistent with moving towards minimizing the cost function. However, would it be valid to consider adding the gradient and then use the cost function itself (rather than its derivative) in the update rule, with a condition to multiply by -1 when needed? I’m trying to understand the implications of modifying the update rule in this way and whether it aligns with standard practices in optimization.
Additionally, what is the rationale behind using the derivative of the cost function in the update rule instead of the gradient itself? I’m curious about the reasoning behind using the derivative during the optimization process.