My intuition is that GD would converge faster if you update the parameters one by one because the second update will be more “accurate” if it considers the first parameter update - the first parameter update will slightly change the shape of the curve for the second parameter.
Of course, it’s computationally efficient to calculate the gradient at once and update the parameters, but there’s no fundamental reason for doing so, am I right?
Hey @xl0,
Welcome to the community. Computational Efficiency is indeed one of the major reasons why simultaneous updates are better than non-simultaneous updates. But if we keep that aside, it’s hard to judge which one is better and why.
I guess I agree with your point of view that
and think the same. However, I don’t have any conclusive proof of this intuition. Even the Internet seems to be controversial about this topic. Have a look at some of these threads.
Most of them raises the point of computational efficiency, but some of them have raised some interesting points as well. For instance:
Both simultaneous and non-simultaneous updates are common for various iterative solvers. For example, look up Jacobi and Gauss-Seidel on wikipedia; the former is simultaneous and the latter is not. They exhibit both different convergence properties (i.e., there are cases where one converges and the other does not) and different rates of convergence.
Let me see if I can come up a quick experiment to validate this hypothesis. Till then, happy learning.
Hello @xl0 and @Elemento, this is a very interesting discussion. My two cents, and definitly not an answer but merely some inputs from me:
In short, I think updating all parameters at the same time is the default when we have no knowledge about the parameter space except for the first order gradients.
Let’s look at the following example of a complicated parameter space:
Let’s say we are at the initial position (blue circle), if we know every details about the whole parameter space, we probably would want to go in the blue dashed-line path by only updating w1 first followed by updating only w2 until when it gets into the valley 3 , after that it doesn’t matter whether we are updating both w1 and w2 or only w2.
If we update both weights at the same time, we might get into the valley 1 instead.
(Should point out that if we had known the full picture (of the space) at first we didn’t need to do gradient descent - because we already knew where the minimum is.)
So I believe the parameter updating strategy would depend on how well we know about the parameter space, for example, I would ask myself, do we also calculate the higher order of gradients to get more information about the curvature nearby, and is there a way to make use of those higher order gradients to drive the strategy?
For the gradient descent we have learnt (and also popularly used), we only calculate the first order gradient, which only talks about the space right next to our current location, but not far away.
P.S. I never read research paper on this topic so my knowledge here is really very limited.
Hey @rmwkwok,
The following point that you have made seems interesting
I am assuming this means that if we have some knowledge of the parameter space (which we generally don’t have), we can give non-simultaneous updates a try if we somehow exploit the knowledge of higher order derivatives (provided we don’t care about the sequential computation).
This raises another query in my mind. Do you think that the knowledge of the parameter space would affect the sequential order in which we should update the parameters, for instance, updating w2, then b, then w3 or some order of similar sorts?
By the way @xl0, I did try an experiment, which you can check out in this kernel (Version 4). Based on @rmwkwok and @shanup feedback, I have updated this kernel.
Lets go back to what Gradient Descent does:
At every point on the J Vs (w,b) curve, we are trying to find the steepest descent. So, wherever we are CURRENTLY on the Cost curve there is a SPECIFIC VALUE for w,b - And from that EXACT POINT, we are in the quest for the steepest descent path. This process is repeated till we reach a minima.
If the intention is not to violate the mathematical intent of what we are doing, then both paramters have to be simultaneously updated - Why? Because we started from a specific point, found the steepest descent FROM THAT POINT. Hence, both values have to be updated from the specific point at which the gradient was taken. This ensures the mathematical integrity of what we are doing.
Now, comes the 2nd part: What happens if we don’t update together. My answer to that is – We really can’t be sure because there is a certain blind side to our method. However, our saving grace is that, even in the non-simultaneous update case, we would update the value of w and then again go back and find the path of steepest descent to update the value of b. So, even in this case we are still looking for an optimal path, just that there is a little more recklessness (for lack of a better word) in how we are going about it. As @Elemento has shown in an example – This kind of recklessness might even help to converge faster, but in certain other cases, depending on the shape of the contour, we might get knocked around as well.
# Number of examples
m = X.shape[0]
# Initial Values
init_w = np.random.rand(2, 1)
init_b = [0]
# Number of iterations
num_iters = 30000
# Learning Rate
lr = 1e-2
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X = (X-X_mean)/X_std
Add the following line as the last line for both code cell #6 and #8
print((np.squeeze(w) / X_std), b - (np.squeeze(w)*X_mean/X_std).sum())
In code cell #7, modified the following 2 lines into:
dj_dw_0 = (1/m) * np.sum(X[ : , [0]] * err)
dj_dw_1 = (1/m) * np.sum(X[ : , [1]] * err)
After the change, both simultaneous & non-simultaneous approach get to the good parameters, and basically very similiar number of iterations. Nobody appears faster, but the simultaneous approach is better because of less computation.
I think the graph I shared in my first reply is talking about this, because if we had known in prior that the space looked like that, we would have wanted to only update w1 at first to avoid being trapped in valley #1 (which is by going through the red path when updating both weights simultaneously).
However this is easier to say than to do algorithmically, and getting that picture could be very computationally expensive.
Hey @rmwkwok,
Thanks a lot for pointing out the errors that I made in this kernel.
These indeed escaped my attention completely. Leaving out on scaling was intentional although. Performing scaling results in altering the parameters from their ideal values. That’s why I thought to leave it so that I can find how far away the optimized values are from their ideal values.
It never occurred to me that we could simply rescale them back to find the true estimates. A simple yet brilliant thing. And while we are on that note, do you mind telling me why the below expression works. The rescaling of w is pretty trivial, but I can’t get a hold onto the below rescaling.
Really great discussion! I didn’t read the stack exchange and reddit links, but I think the high level point is the cost of computation. You have to keep in mind that in real ML/DL applications, we’re frequently talking about millions of parameters and millions of training samples. So if you add another effectively “inner loop” over individual parameters on the updates, that’s just going to take forever. The one very important technique worth mentioning in this context is “minibatch gradient descent”. Does Prof Ng discuss that concept here in MLS? In DLS, he doesn’t get to it until DLS Course 2. The idea of minibatch is that if you have millions of training samples, the way Gradient Descent works is that it is averaging the gradients over all the samples. So minibatch speeds up the parameter updates by adding an inner loop over relatively small subsets of the training samples (say 32 or 64 samples) and then computes gradients averaged over that much smaller subset and applies the updates. Rinse and repeat, marching through the whole training set one “minibatch” at a time.
The normal terminology here is that one “epoch” of training consists of one complete pass through the entire training set. With Minibatch GD, each such “epoch” does many updates of all the parameters: one for each minibatch subset of training samples. In most cases that results in needing fewer total “epochs” of training in order to get good convergence.
Thank you for bringing up about minibatch, and that certainly is a great technique to train faster! So far only the course 1 and course 2 are released but they do not cover minibatch gradient descent, and we will wait and see what course 3 will talk about!
P.S. I have also referred a learner here to DLS for more interesting topics!
Thanks a lot @rmwkwok for such an amazing explanation. In fact, I read the complete explanation on your other post as well .
As for mini-batch GD, Prof Andrew hasn’t mentioned it in any of the lecture videos of either C1 or C2, but a note about batches and epochs has been mentioned in one of the C2 assignments.
This is quite an interesting discussion. I would like to add some points from a multi-task perspective. In multi-task learning, you try to learn different tasks at the same time.
Suppose that you have two different tasks. You can choose to update them separately or also try to update them at the same time.
Incase you try to update at the same time, they suffer from something called conflicting gradients (which is why you need to group your tasks carefully).
But if you update them separately, for the first task the gradients might flow in the direction which helps performance of first task but degrades performance of others. This is sometimes also called negative transfer.
Drawing an analogy, I would say that updating the weights wrt one parameter would be helpful for that parameter but overall performance might degrade