Question on algebra for minimizing cost

NimmerStill · July 29, 2024, 7:43pm

Regarding the lecture on tensorflow, we open by defining the cost as w**2 - 10*x + 25, and noting that this can be factored into (w - 5)**2, and then noting that the solution to this when set equal to 0 is w = 5. But since this is about minimizing the cost function as part of back prop, shouldn’t we instead have taken the derivative first, and obtained 2*w - 10, set that equal to zero, and, in this case, obtained w=5? The two solutions are the same for this particular function, but they wouldn’t be in the general case.

TMosh · July 29, 2024, 9:26pm

Yes, you are correct.

The example in the lecture is rather artificial just to make certain points.

paulinpaloalto · July 29, 2024, 9:57pm

Yes, it’s a good point that normally you would set the derivative to zero and solve in a case like this (minimizing a function of 1 variable), but it turns out that is not what we really do with minimizing the cost for Neural Networks. This example is too simplistic in that you can see what the minimum is just by examining the cost function itself. In real cases it’s not that simple. But we also don’t just “set the derivative to zero and solve”, because for real cost functions for real networks, that just gives us another function that we have to solve by an approximation method. So we’d end up doing Newton-Raphson on the first derivative.

That turns out to just make the problem more complicated, so we end up doing Gradient Descent directly on the function. It involves using the derivative (gradient) but to do a stepwise approximation to a minimum value, which will probably end up being a local minimum.

NimmerStill · July 29, 2024, 10:17pm

Even so, that sounds a lot closer to setting the derivative to zero and solving, than what was illustrated, which was setting the original function to zero and solving. Indeed, I did an experiment replicating the lecture example with a different degree-2 polynomial, and running the gradient tape; the result was a lot closer to the root of the derivative than to the two roots of the original:

import numpy as np
import tensorflow as tf

w = tf.Variable(0, dtype=tf.float32)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)

def train_step():
    with tf.GradientTape() as tape:
        cost = w ** 2 - 15 * w + 50
    trainable_variables = [w]
    grads = tape.gradient(cost, trainable_variables)
    optimizer.apply_gradients(zip(grads, trainable_variables))

for i in range(1000):
    train_step()
print(w)
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=7.5>

where: diff(w**2 - 15*w + 50, w) = 2*w - 15

and 2*w - 15 = 0 -> w = 15/2 = 7.5
which matches the result above. Setting the original to 0 would give:

w**2 - 15*w + 50 = 0 -> (w - 10) * (w - 5) = 0 -> w = 10 or w = 5

which is nowhere near the experimentally derived value.

So I still think that the example could have been illustrated better.

TMosh · July 29, 2024, 10:54pm

I agree.

Topic		Replies	Views
Finding local minima of Cost Function Neural Networks and Deep Learning coursera-platform	2	536	May 25, 2021
Equate the derivative of cost to 0 zero to get the weight 'w' Supervised ML: Regression and Classification week-module-1	4	483	June 12, 2023
Gradient descent and derivatives Neural Networks and Deep Learning coursera-platform	2	360	October 6, 2023
C2W3 Tensorflow videos, Question about the discussed cost function? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	512	August 31, 2021
Why Gradient Decent is required Neural Networks and Deep Learning coursera-platform	3	657	October 28, 2022

Question on algebra for minimizing cost

Related topics