Interestingly, there is a recent paper by Geoffrey Hinton, one of the godfathers of Machine Learning, called âThe Forward-Forward Algorithmâ where he precisely gets rid of back propagation. @rmwkwok , one of our mentors, ventured into this new model and shared his experience in this thread in a very clear and generous way. Important to clarify that this is material not covered in any of the specializations at least yet.
Now, regarding your idea of doing backprop from left to right âas we goâ, this is my understanding on how this would be probably not possible:
The backprop is based on the derivatives of each mathematical operation that happens inside each layer. To calculate the derivative of every intermediate variable with respect to the loss you need the âlocal derivativeâ of the variable times the derivative of the loss up to the next variable in the computation node, which uses the chain rule.
For example, we have this computational graph:
c = a * b
e = c + d
h = e / f
Say the goal is to compute dh/da. For this we need at a minimum the following âlocalâ derivatives: dh/de, de/dc, dc/da.
Fast-forwarding, after some operations and some applications of the chain rule, you can do: dh/dc * dc/da which is the final application of the chain rule that can lead to dh/da.
But as shown above, to get to dh/dc you went through other series of intermediate operations and chain-rule applications. You had to first find, for instance, dh/de and de/dc.
Now lets see this in a neural network.
Say for instance you have this architecture:
Input
Layer 1
Batch Norm
TanH Activation
Layer 2
Loss function (say Cross-Entropy loss)
Each one of these layers above is composed by several mathematical expression that have to be considered in the derivatives used to calculate the new parameters.
To calculate the derivatives of the operations in Layer 1, we will need information from BatchNorm.
To calculate the derivatives of the operations in BatchNorm, we will need information from TanH activation.
To calculate the derivatives of TanH activation, we will need information from Layer2.
To calculate the derivatives of Layer2, we will need information from the Loss function.
Based on this, we need to wait until the last operation to be able to start going backwards, to be able to calculate the local derivatives and then apply chain rule in the local result with the derivatives from the last term until the very next term of the current operation.
What do you think?