Hey @Daniel_Breyer,
In my opinion, it has nothing to do with computation.
The idea of gradient descent is, if you imagine a cost surface, to determine for each dimension, which direction to walk next (and how large the step). There are only two directions in each dimension - positive or negative, and the chosen direction will bring us to a lower cost with respect to that dimension.
Do we agree on the above?
If there are only two trainable weights, the cost surface will be 3 dimensional (one dimension for one weight and one dimension for the cost). In a multiple layer neural network which has 20,000 trainable weights, the cost surface will be 20,001 dimensions. And, the idea of gradient descent does not change - still to find for each dimension, which way to go.
Therefore we are interested in \frac{\partial{J}}{\partial{w_0}}, \frac{\partial{J}}{\partial{w_1}}, ... , \frac{\partial{J}}{\partial{w_{19999}}}, because they tell us which way to go for all 20,000 weights.
Now, we focus on the J. What does J depend on? The label and the prediction. Right? So, we can write J(y_{pred}, y_{true}).
And, what does the predictions y_{pred} depend on? The weights and the input X. Right? So, we can write J( w, X, y_{true}) where w represents all 20,000 weights.
Therefore, for weight w_J, we are interested to know \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}. It is important to note that I have not said which layer w_j comes from, because that information is totally irrelevant. The maths does not care whether this is a neural network or whether this is just a equation (in fact, we can write down a neural network into one very long line of equation). The concept of layers does not matter here.
Now, I hope you have seen that the gradient we care about is \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}, where w is the weights before update. This is also the same w that is remembered in the forward phrase.
Our current approach is to determine all gradients before taking any move. This has nothing to do with the concept of layers. We are standing on a point of the cost surface, look around in all dimensions, determine which way to go for each dimension, and then take the step in all dimensions. That’s it!
If you have followed through me till now, I hope you have seen that, it has nothing to do with layers. Yes, people back-propagate because they can reuse the gradient computed in later layer for earlier layer, but you can give up that advantage. You can compute the gradients from the first layer to the back. As long as you keep the weights unchanged, computing from front to back or back to front won’t make any difference in terms of the resulting values of the gradients.
I hope, by now, you see what we are doing with the gradient descent approach (the parallel approach).
Continuing to the sequential approach, it will be equivalent to: I stand on the cost surface, I first look at w_0 dimension, then I take a step in just that dimension, and then I look at w_1 dimension, then I take another step in just that dimension, and so on and so forth. In this way, we are going to take 20,000 steps to complete one cycle of updating all weights. (I know you are going to say, no - I don’t mean that, because I am updating all weights in one layer at a time, so it won’t be 20,000 steps, but L steps where L is the number of layer, but this is not important because the idea is pretty similar and we are not even worrying about computational efficiency, we can assume all computations takes no time for now)
Would the above approach work? It may, I dare not say no, because I have never looked into that. What I mean is, I don’t know what is the good/bad consequence of such approach. For example, why should I pick w_0 to always go first? Would such choice has any bad bias? Is such choice an informed choice? Moreover, I used bold type for the two look in above. The mathematics of each action of looking in the w_j dimension is to do a complete forward pass to compute the cost, and then compute the gradient \frac{\partial{J( w, X, y_{true})}}{\partial{w_j}}.
Up to now, I am not analyzing the pros and cons of the two approach. I am more to lay out the maths steps we need to take to implement both. Obviously, the parallel approach needs way less number of maths/computational steps than the sequential approach in order to update all of the weights once.
The two approaches are obviously different, but I have only seen the parallel approach.
Cheers,
Raymond