Gradient descent in NN: Order of updating weights in different layers

During gradient descent of a single layer, all weights W, B are updated in parallel.

Now if there are 2 (or more) layers, I am wondering, if that happens for the weights of different layers as well:

If I update W[2] and B[2], but then updated W[1] and B[1] at the same time (= before W[2] and B[2] have been updated), I’m wondering, if the update of W[1], B[1] couldn’t actually increase J, as it optimizes J based on the old values of W[2] and B[2].
In the end the optimal adjustment of the weights of layer 1 depends on the current values of the weights of layer 2.

=> Are the weights W[N-1], B[N-1] actually updated after W[N] and B[N] have already been updated and not in parallel to them?

Hi @Daniel_Breyer, Thanks for your post.

I will try to answer your question based on what i understood from you.

First you need to know that in multi-layer neural network, the process of updating weights “W” and biases “B” is sequential. And i guess you already know how the gradient descent process working

In case you can't remember. In gradient descent the network optimizes its performance by iteratively adjusting the weights and biases to minimize the chosen loss function

So when updating the weights and biases of different layers, the updates occur in a specific order. Typically starts from the output layer and move backward through hidden layers to the input layer. This order ensures that each layer’s updates take into account the most recent adjustments made in the subsequent layers.

In essence, the weights W[N-1] and biases B[N-1] are indeed updated after the weights W[N] and biases B[N] have been updated

I hope you got it now and feel free to ask for another clarification
Best Regards,
Jamal

HI @Jamal022,

Thank you so much for your reply! Just to make sure that I understand your answer:
The weights and biases of different layers are NOT updated in parallel (as weights within the same layer), but after recalculating => so “that each layer’s updates take into account the most recent adjustments made in the subsequent layers.”

If not, I still would wonder about the following:
Imagine in the very last layer you update the weights via gradient descent. So you updated J towards a local minima. But you optimized it based on values of the weights of the previous layer(s). If you would now update the weights of the previous layer in parallel (as you do with weights within the same layer) imo J could theoretically increase. Although I guess that in practice it still would convert in most of the times.

I btw feel that this topic is not properly covered in the classes (= how exactly are the different layers updated and not just one). I get that the underlying math might be a bit too much for some of the target audience, but an optional video would really be awesome.

Hello @Daniel_Breyer!

The back propagation of a deep neural network worths two weeks of course material, and they are covered by the Deep Learning Specialization’s Course 1 Week 3 to 4, but before you consider them, I can give you a quick summary.

You know, a training step includes two smaller steps: forward and backward propagations.

In the forward, all layers’ outputs and weights are remembered, and of course, the computation goes from the first layer, then the second, and so on.

In the backward, all layers’ weight gradients are computed backward from the last layer back to the first. Note that, over the whole process, no weights are ever updated. Let me repeat, when we compute the gradients, all of them are based on the layers’ outputs and weights that are remembered during the forward stage. I hope this point is clear.

After computing all the gradients, then we apply them to all of the weights, and of course, here the order does not matter. We can update whichever weight we like first. After the update, the network is ready for the next training step!

Cheers,
Raymond

Hey @Daniel_Breyer, Thanks for your reply.

I hope you got it now after clarification from @rmwkwok and feel free to ask for more clarification if needed.

Regards,
Jamal

Thank you so much for the explanation!
I guess I should have chosen the deep learning course then (doing the ML specialization).

Btw according to ChatGPT (which of course might be wrong) there are 2 different approaches:

  • The one that I expected/ that seemed intuitive
  • The one that you explain (at which I need to have a closer look, as it’s somewhat less intuitive to me, why it’s done like that)
  1. Sequential Weight Updates:
    When updating weights sequentially from the output layer to the input layer, the optimization process considers the most recent changes in weights as it moves through the layers. This approach aligns with the natural flow of information in the network, ensuring that updates take into account the outputs of previous layers. This is the more common practice, especially in deep neural network training.

  2. Simultaneous Weight Updates (All at Once):
    Updating all the weights at once after computing gradients can indeed work, and certain automatic differentiation techniques support this method. However, while the order of updates may be flexible, it’s important to ensure that the updates are performed using the most recent information from the forward pass.

Does that make sense, or is ChatGPT just completely off here (I re-asked a couple of times in different ways, but it insisted on this one)?

I have not heard of the first approach. I don’t want to just say it is wrong because it is different from the usual approach, but the way I explain makes sense because in each training step, the output (and consequently the computed loss) is based on the weights remembered during the forward phrase, therefore, the gradient should also be based on the same set of remembered weights during the forward phrase.

Perhaps ChatGPT should reveal more on how its first approach works down to the last bit of details - for example, does it just consider the most recently updated weight or does it work in the following flow: forward phase → compute a loss → update the weight in the last layer → recompute a new loss based on the new weight → update the weight in the second last layer → recompute a new loss → update the third last layer → … It would even be better if ChatGPT can give any names so that you can google for papers, explanations, rationales, or anything that will help explain why the approach is useful at what case. If you are interested and find anything, please share and I also want to take a look.

Cheers,
Raymond

1 Like

Hmm, still it feels strange to me to update all the weights in parallel. I don’t tend to think about which computation is more effective as I have a math/ theoretical physics background.
But it feels counterintuitive to me to remember all weights and then do all the calculation and updates (no matter in which direction) at once. Why? Because the ‘ideal’ updates of weights (to get to a minima) in different layers depend on each other.
One extreme example:
The loss function could be in a local minima after updating just the last layer, so if you recalculate everything, the algorithm should not do any further updates at all. BUT if you instead update all the previous layers based on the gradients of the non-updated loss function in the same step you would move away from that minima again.
=> I’m still not sure about this as I have to calculate it more in detail. The ML specialization did not really feature that.

Anyway here what ChatGPT told me about how this “Sequential Weight Updates” approach (which it could have made up) in detail:

  1. Forward Pass:
  • Input data is fed into the network, passing through the layers one by one.
  • Each layer computes its output based on the current weights and biases and passes this output to the next layer.
  • The final layer generates predictions or outputs.
  1. Loss Calculation:
  • The predictions from the final layer are compared to the actual target values.
  • A loss function quantifies the difference between the predictions and the actual targets.
  1. Backward Pass and Gradient Calculation:
  • Starting from the last layer and moving towards the input layer, the gradient of the loss function with respect to the layer’s outputs is computed.
  • Using the chain rule of calculus, the gradients are backpropagated through the layers to calculate the gradients of the loss function with respect to the weights and biases of each layer.
  • These gradients represent how much the loss changes when the weights and biases of each layer are adjusted.
  1. Weight Updates:
  • After the gradients are computed for a specific layer, the weights and biases of that layer are updated.
  • An optimization algorithm, such as gradient descent or one of its variants, is used to update the weights and biases based on the calculated gradients.
  • The size of the update is determined by a learning rate or other hyperparameters.
  1. Sequential Update Process:
  • The process repeats for each layer, moving from the output layer to the input layer.
  • The updates for each layer are performed based on the most recent information from the forward pass and the gradients calculated in the backward pass.
  • The sequential update process ensures that the adjustments made to the weights of each layer are informed by the outputs and gradients of the layers that follow it.
  1. Benefits and Considerations:
  • This approach aligns with the natural flow of information through the network.
  • It takes into account the most recent adjustments in later layers when updating earlier layers, which helps in achieving effective convergence.
  • Each layer’s update is influenced by the outputs of subsequent layers, contributing to a coherent optimization process.
  1. Potential Challenges:
  • The sequential update process may require more time to compute and apply updates compared to methods that update all weights simultaneously.
  • There might be additional complexities when dealing with specific neural network architectures or optimization techniques.

It claims the following are the names of some individuals, companies, and entities associated with the sequential weight update approach in neural network training:

Individuals:

  • Geoffrey Hinton
  • Yann LeCun
  • Yoshua Bengio

Companies and Organizations:

  • Google (DeepMind)
  • Facebook AI Research (FAIR)
  • OpenAI

Frameworks and Libraries:

  • TensorFlow
  • PyTorch

and those are supposed to be research papers that discuss the process of training neural networks, including the sequential weight update approach:

  1. “ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton.
  • This paper presents the groundbreaking AlexNet architecture, which helped popularize deep convolutional neural networks for image classification.
  1. “Playing Atari with Deep Reinforcement Learning” by Volodymyr Mnih et al.
  • This paper introduces the concept of training deep Q-networks (DQN) using reinforcement learning, showcasing the sequential weight update approach in action.
  1. “Sequence to Sequence Learning with Neural Networks” by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
  • This paper introduces sequence-to-sequence models using recurrent neural networks (RNNs), demonstrating how sequential weight updates are crucial for handling sequences in tasks like machine translation.
  1. “Attention Is All You Need” by Ashish Vaswani et al.
  • This influential paper introduces the Transformer architecture, which employs an attention mechanism to process sequences in parallel while still benefiting from sequential weight updates.
  1. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” by Jacob Devlin et al.
  • BERT (Bidirectional Encoder Representations from Transformers) demonstrates the importance of sequential weight updates in pre-training transformer-based models for various natural language understanding tasks.
  1. “Deep Residual Learning for Image Recognition” by Kaiming He et al.
  • This paper presents the ResNet architecture, which introduces residual connections to alleviate the vanishing gradient problem during sequential weight updates in very deep networks.
  1. “Generative Adversarial Nets” by Ian Goodfellow et al.
  • This seminal paper introduces Generative Adversarial Networks (GANs), showcasing how sequential weight updates are used in adversarial training to improve the quality of generated data.
  1. “Attention U-Net: Learning Where to Look for the Pancreas” by Ozan Oktay et al.
  • This paper demonstrates the application of attention mechanisms in medical image segmentation, highlighting the combination of sequential and parallel information processing.

I’m somehow in doubt, that GPT is completely right here, but need to get back to the basics first before I can read and understand those papers I guess:)

Hey @Daniel_Breyer,

In my opinion, it has nothing to do with computation.

The idea of gradient descent is, if you imagine a cost surface, to determine for each dimension, which direction to walk next (and how large the step). There are only two directions in each dimension - positive or negative, and the chosen direction will bring us to a lower cost with respect to that dimension.

Do we agree on the above?

If there are only two trainable weights, the cost surface will be 3 dimensional (one dimension for one weight and one dimension for the cost). In a multiple layer neural network which has 20,000 trainable weights, the cost surface will be 20,001 dimensions. And, the idea of gradient descent does not change - still to find for each dimension, which way to go.

Therefore we are interested in \frac{\partial{J}}{\partial{w_0}}, \frac{\partial{J}}{\partial{w_1}}, ... , \frac{\partial{J}}{\partial{w_{19999}}}, because they tell us which way to go for all 20,000 weights.

Now, we focus on the J. What does J depend on? The label and the prediction. Right? So, we can write J(y_{pred}, y_{true}).

And, what does the predictions y_{pred} depend on? The weights and the input X. Right? So, we can write J( w, X, y_{true}) where w represents all 20,000 weights.

Therefore, for weight w_J, we are interested to know \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}. It is important to note that I have not said which layer w_j comes from, because that information is totally irrelevant. The maths does not care whether this is a neural network or whether this is just a equation (in fact, we can write down a neural network into one very long line of equation). The concept of layers does not matter here.

Now, I hope you have seen that the gradient we care about is \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}, where w is the weights before update. This is also the same w that is remembered in the forward phrase.

Our current approach is to determine all gradients before taking any move. This has nothing to do with the concept of layers. We are standing on a point of the cost surface, look around in all dimensions, determine which way to go for each dimension, and then take the step in all dimensions. That’s it!

If you have followed through me till now, I hope you have seen that, it has nothing to do with layers. Yes, people back-propagate because they can reuse the gradient computed in later layer for earlier layer, but you can give up that advantage. You can compute the gradients from the first layer to the back. As long as you keep the weights unchanged, computing from front to back or back to front won’t make any difference in terms of the resulting values of the gradients.

I hope, by now, you see what we are doing with the gradient descent approach (the parallel approach).

Continuing to the sequential approach, it will be equivalent to: I stand on the cost surface, I first look at w_0 dimension, then I take a step in just that dimension, and then I look at w_1 dimension, then I take another step in just that dimension, and so on and so forth. In this way, we are going to take 20,000 steps to complete one cycle of updating all weights. (I know you are going to say, no - I don’t mean that, because I am updating all weights in one layer at a time, so it won’t be 20,000 steps, but L steps where L is the number of layer, but this is not important because the idea is pretty similar and we are not even worrying about computational efficiency, we can assume all computations takes no time for now)

Would the above approach work? It may, I dare not say no, because I have never looked into that. What I mean is, I don’t know what is the good/bad consequence of such approach. For example, why should I pick w_0 to always go first? Would such choice has any bad bias? Is such choice an informed choice? Moreover, I used bold type for the two look in above. The mathematics of each action of looking in the w_j dimension is to do a complete forward pass to compute the cost, and then compute the gradient \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}.

Up to now, I am not analyzing the pros and cons of the two approach. I am more to lay out the maths steps we need to take to implement both. Obviously, the parallel approach needs way less number of maths/computational steps than the sequential approach in order to update all of the weights once.

The two approaches are obviously different, but I have only seen the parallel approach.

Cheers,
Raymond

2 Likes

For the ChatGPT responses, although I have not read all of the papers it listed in detail, if I just glance the names, I am pretty sure many of them have nothing to do with sequential weight updates at all. ChatGPT is just generating. The point of asking for reference here is for us to check, so you might have a check. Judging from the names of the papers, I don’t think any of them is about the sequential weight updates itself, but it may not be fair to judge by the names…

ChatGPT is very generative, so I don’t take its words for granted…

:grimacing: I think things are starting to get a bit confused here.

Let’s go back to the basics of @Daniel_Breyer 's original question. When we calculate gradients , we use backprop starting at the output layer, so that we chain back through the layers. This is because the gradient at any given layer is dependent on the gradients for the subsequent layers. By starting at the output end, we can chain our calculations backwards, so that we have everything we need at each layer as we work our way backwards through the layers. I think everyone is in agreement on this. At least @rmwkwok in his first post seem to be in agreement when he wrote:

In the backward, all layers’ weight gradients are computed backward from the last layer back to the first. 

I think this is the heart of @Daniel_Breyer’s original question - so, yes, at least when it comes to gradients, we are calculating the gradients for later layers before the layers that precede them.

Next, though, is the step of updating the weights based on these gradients. This could be done at each layer as an additional step as we work our way backwards through the backprop. OR, we could do it separately after computing the gradients.

To me, the choice here is all about computational efficiency. By recognizing that this step can be separated and done independently, it gives us the opportunity to potentially take advantage of multiple processors, or hardware optimizations for matrix math. Of course, the “us” in this case, is really the developers of the TensorFlow (or comparable) library that we use, since this part is handled by them under the covers. But, still helpful to have a basic sense of what’s going on.

Some of @rmwkwok 's latest explanation of the parallel approach seem inconsistent with this to me, which could just be that I’m misunderstanding what he’s explaining. I think he’s in agreement with what I’ve written here, based on his earlier post. @rmwkwok?

1 Like

Hello @Wendy, I agree with you that “backprop” itself is about computational efficiency because it avoids recomputing the computed values, however, I think the core of @Daniel_Breyer’s latest question is NOT about “why backprop”, but on the difference between their “Sequential Weight Updates” and “Simultaneous Weight Updates”, because the learner said

I believe the learner is questioning why it is intuitive to do it in the “Simultaneous Weight Updates” mode.

And I think the most significant difference between the two modes is not about computational efficiency.

Let me give a quick summary of what I believe the two modes are:

  1. “Simultaneous Weight Updates”:
    a. Compute all the gradients first. During the computation of gradient, no weights are updated.
    b. After the gradients are ready, update all of the weights with the gradients.

  2. “Sequential Weight Updates”:
    a. Compute the gradients in the last layer first, then update the weights of the last layer.
    b. Compute the gradients in the second last layer, and when such computation needs weights from the last layer (because image), use the updated version of weights (the updated image. Then, update the weights of the second last layer.
    c. Repeat for the rest of the layers such that each layer’s gradients are computed with the updated version of weights.

Here, the main difference is, whether we do update or do not update any weights during the computation of any gradient values.

In the gradient descent that we have learnt, the whole story is “Simultaneous Weight Updates” + “backprop” + “computational efficiency”. Together they form our strategy.

Now, I believe @Daniel_Breyer is questioning the first part of the story, and the learner does not question the fact that remembering computed values to avoid re-computing them again will increase computational efficiency, because the learner said

The learner felt it was counterintuitive to go for the “Simultaneous Weight Updates” mode. The learner was looking for a way to see why we wanted to do it “at once” (which is the “Simultaneous Weight Updates” mode) instead of letting it depend on each other (which is the “Sequential Weight Updates” mode).

Therefore, I also don’t want to focus on the second and the third part of the story to emphasize the benefit of computational efficiency.

We can beat down the “Sequential Weight Updates” mode by computational efficiency. However, this is not how I want to do it. The learner doesn’t think it is right to do it with the “Simultaneous Weight Updates” in the first place, so the focus should be about the two modes themselves.

That’s why I delineated it into three parts: “Simultaneous Weight Updates” + “backprop” + “computational efficiency”, and I chose to focus on the first part.

Of course, computational efficiency must be considered, and the second and the third part of the story are important, but I wanted to go into those parts after the learner accepts the first part: “Simultaneous Weight Updates”.

However, @Daniel_Breyer can tell whether my focus was their focus :wink: :wink:

@Daniel_Breyer, since you didn’t tend to think about the efficiency of computation (as you said in the first paragraph of your last reply), I jumped in with you too, and hoped to show you two pictures (through the analogy of cost surface) for how we could do the two approaches without commenting on which is better. My intention is, at least, you can find both approaches to make sense, not about why one is always better than the other. After you find them both make sense, then I think we cannot avoid thinking about computational efficiency :wink:

If you think my focus was your focus, and you couldn’t follow my response, please let me know. If you could follow it and accept both “parallel” and “sequential” modes, please also let me know and then we can talk about computational efficiency.

Cheers,
Raymond

1 Like

Hello @Wendy,

This is from the learner’s first post.

Backprop calculates the gradients from the last layer back to the first layer, but it also ONLY computes the gradients during that. ONLY after all the gradients are computed, we then finally update the weights. This is the learner’s " Simultaneous Weight Updates", which is also how we do it.

Gradients are computed in order from the last layer to the first. Weights update doesn’t care about the order.

The learner didn’t reject the idea of computing the gradient starting from the last layer, because from the question, we can see that, the learner wanted to compute the gradient in the last layer first, BUT, the learner then wanted to update the weights in the last layer, before computing the gradient in the second last layer and update the weights in the second last layer. This is the learner’s " Sequential Weight Updates"

That’s why I think the learner and we both agree with backprop, and the learner and I chose not to think about “computational efficiency” for just now, and we focus on Simultaneous Weight Updates vs Sequential Weight Updates which the main difference is about the timing of updating the weights.

I was trying to jump into the way the learner was thinking, and hopefully I could walk out from there with the learner. :wink: @Daniel_Breyer, I hope I didn’t misunderstand you, but I hope to walk out from there with you and we will reach this as you said from the beginning:

1 Like

Although " Sequential Weight Updates" isn’t what we do everyday, it was not completely non-sense. It is just that the learner’s description to it has missed some critical steps, and those are the steps that will, perhaps, persuade the learner that Simultaneous Weight Updates is better.

1 Like

Thank you so much! This is very very helpful. I was confused about potential interdependencies of the weights in different layers (which led me to believe the weights shouldn’t be updated in parallel) and your detailed explanation helped me realize my misconception. Now that I am looking at what you laid out, I’m not even sure anymore where my confusion came from… In any case this helped me to understand the concept much better!
Thx a lot and cheers from Berlin

You are welcome, @Daniel_Breyer. We could have cleared things up more efficiently if we were sharing a blackboard :writing_hand::writing_hand:. Thank you for reading my long replies.

Cheers,
Raymond

1 Like