Regarding the lecture on tensorflow, we open by defining the cost as w**2 - 10*x + 25
, and noting that this can be factored into (w - 5)**2
, and then noting that the solution to this when set equal to 0 is w = 5
. But since this is about minimizing the cost function as part of back prop, shouldn’t we instead have taken the derivative first, and obtained 2*w - 10
, set that equal to zero, and, in this case, obtained w=5
? The two solutions are the same for this particular function, but they wouldn’t be in the general case.
Yes, you are correct.
The example in the lecture is rather artificial just to make certain points.
Yes, it’s a good point that normally you would set the derivative to zero and solve in a case like this (minimizing a function of 1 variable), but it turns out that is not what we really do with minimizing the cost for Neural Networks. This example is too simplistic in that you can see what the minimum is just by examining the cost function itself. In real cases it’s not that simple. But we also don’t just “set the derivative to zero and solve”, because for real cost functions for real networks, that just gives us another function that we have to solve by an approximation method. So we’d end up doing Newton-Raphson on the first derivative.
That turns out to just make the problem more complicated, so we end up doing Gradient Descent directly on the function. It involves using the derivative (gradient) but to do a stepwise approximation to a minimum value, which will probably end up being a local minimum.
Even so, that sounds a lot closer to setting the derivative to zero and solving, than what was illustrated, which was setting the original function to zero and solving. Indeed, I did an experiment replicating the lecture example with a different degree-2 polynomial, and running the gradient tape; the result was a lot closer to the root of the derivative than to the two roots of the original:
import numpy as np
import tensorflow as tf
w = tf.Variable(0, dtype=tf.float32)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
def train_step():
with tf.GradientTape() as tape:
cost = w ** 2 - 15 * w + 50
trainable_variables = [w]
grads = tape.gradient(cost, trainable_variables)
optimizer.apply_gradients(zip(grads, trainable_variables))
for i in range(1000):
train_step()
print(w)
<tf.Variable 'Variable:0' shape=() dtype=float32, numpy=7.5>
where: diff(w**2 - 15*w + 50, w) = 2*w - 15
and 2*w - 15 = 0 -> w = 15/2 = 7.5
which matches the result above. Setting the original to 0 would give:
w**2 - 15*w + 50 = 0 -> (w - 10) * (w - 5) = 0 -> w = 10 or w = 5
which is nowhere near the experimentally derived value.
So I still think that the example could have been illustrated better.
I agree.