Why only use backprop to adjust parameters?

So backpropagation is used to in the opposite direction of the forward pass, the reason I found is because of chain rule.

But we can also do it from left to right, like fwd pass we can calculate the gradients of the parameters in first hidden layer and then replace the value in the chain rule of the second layer and so forth.

Why only back prop?

Nice question @tbhaxor

I am not an MLS mentor and I don’t know the exact example that has been used but with backpropagation it’s much easier to spread the error (which we have calculated) to each of the weights leading to a subsequent update to the weights using backpropagation and feedforward operation simultaneously…

1 Like

could you explain , it is not clear to me?

Interestingly, there is a recent paper by Geoffrey Hinton, one of the godfathers of Machine Learning, called “The Forward-Forward Algorithm” where he precisely gets rid of back propagation. @rmwkwok , one of our mentors, ventured into this new model and shared his experience in this thread in a very clear and generous way. Important to clarify that this is material not covered in any of the specializations at least yet.

Now, regarding your idea of doing backprop from left to right “as we go”, this is my understanding on how this would be probably not possible:

The backprop is based on the derivatives of each mathematical operation that happens inside each layer. To calculate the derivative of every intermediate variable with respect to the loss you need the ‘local derivative’ of the variable times the derivative of the loss up to the next variable in the computation node, which uses the chain rule.

For example, we have this computational graph:

c = a * b
e = c + d
h = e / f

Say the goal is to compute dh/da. For this we need at a minimum the following “local” derivatives: dh/de, de/dc, dc/da.

Fast-forwarding, after some operations and some applications of the chain rule, you can do: dh/dc * dc/da which is the final application of the chain rule that can lead to dh/da.

But as shown above, to get to dh/dc you went through other series of intermediate operations and chain-rule applications. You had to first find, for instance, dh/de and de/dc.

Now lets see this in a neural network.

Say for instance you have this architecture:

Input
Layer 1
Batch Norm
TanH Activation
Layer 2
Loss function (say Cross-Entropy loss)

Each one of these layers above is composed by several mathematical expression that have to be considered in the derivatives used to calculate the new parameters.

To calculate the derivatives of the operations in Layer 1, we will need information from BatchNorm.

To calculate the derivatives of the operations in BatchNorm, we will need information from TanH activation.

To calculate the derivatives of TanH activation, we will need information from Layer2.

To calculate the derivatives of Layer2, we will need information from the Loss function.

Based on this, we need to wait until the last operation to be able to start going backwards, to be able to calculate the local derivatives and then apply chain rule in the local result with the derivatives from the last term until the very next term of the current operation.

What do you think?

1 Like

It is not clear to me, the batch norm takes input from layer 1 like in your example h is taking input from e and f. So h depends on e and f, similarily batch norm should depend on layer 1 not vice versa. We can calculate derivates of params of layer 1 and plug into batch norm layer :confused:

@tbhaxor ,

I was investigating more about your question and it turns out that you can actually calculate the derivatives as you go. In fact, this has a name: “Back propagation Through Structure” or “Back propagation Through Time” (BPTT).

According to wikipedia, and other sources, this type of backprop works but it requires more computations so it is inefficient when compared to the traditional backprop.

Additionally, this BPTT is used in certain types of networks, like in Recurrent Neural Networks (RNN).

Thank you for pushing the case - I learned something new :slight_smile:

Juan

Hi @tbhaxor

In addition to what mentors @Juan_Olano, and @Isaak_Kamau said

We doing backpropagation from back as many of the models used today is already trained before (we just take the parameters from the trained models called transfer learning), so that we didn’t want to train all layer in the model we want to train just the few last layer to adjust (complex)parameter (that detect high level features unlike the first layer which detects only the edges[low level features]) to serve the training based on the decision that the model takes according to the training of the last few layers, so we didn’t want to make an gradient from the first layer to update the parameters of the last layer it’s will be wasting the time

Cheers,
Abdelrahman

1 Like

hello @tbhaxor

What you say could be done, however there is a missing link. You mentioned that we can take the gradient of the first layer and work forward, rather than take the gradient of the last layer and work backwards - Let’s dissect this.

Our primary aim is to find the weights of every neuron such that the cost J is minimized. The cost J is defined ONLY at the output layer and is dependent on 2 things - y_{actual} and y_{predicted}. y_{actual} is available in the training data. y_{predicted} is what the model predicts at the “Output Layer” (not any other layer). The learning algorithm then finds the derivative of J with respect to the weights of each neuron, starting from the output layer, and using the chain rule, working its way backwards to the first layer.

Now, if we knew the cost J at the first layer, we could have started from the first layer. But since, we only have the cost J defined at the output layer we have no choice but to do this backwards.

To summarize: to be able to take the gradient in any layer - we need to find the derivative of “something” w.rt to the weights and bias of the neurons in that layer. We have chosen J to be that “something”, but it is defined only at the output layer…hence the backward process.

1 Like

Yeah but it shouldnt have named back prop :smile: Also I have seen on the wikipedia page. This is problematic if we have to wiggly loss function, chances are it will be stuck in local minima. I cant see how, but will research more about it

But if the previous layers are freezed (trainable=False) then adjustments should be ignored. Rather it should be skipped to the first layer from left (usually it is the top layer, output).

Or is it like we say “let it go as it is” because the libraries are too old and changing things now will break existing applications?

Your point also makes sense @AbdElRhaman_Fakhry thanks for pointing out

To summarize:

  1. We’ve discussed the new option presented by G.Hinton called “Forward-Forward” which removes altogether the backprop.

2.We’ve also seen an option called “Backpropagation Through Time” or “Backpropagation Through Structure”, which is mainly used on sequential models like RNN.

  1. We’ve also seen how to update the weights on a ‘static’ model you need to know the loss at the end of the forward propagation, to properly update your weights ( @AbdElRhaman_Fakhry and @shanup ). This is basically the traditional backpropagation, which has proven so far to be very efficient and does the work.

  2. I’d add a forth one: may be you can pre-calculate local-derivatives as you go, and once you reach the Loss at the end of the forward prop, you can run a backprop using chain rule and the pre-calculated local derivatives. I have not tested it nor seen it anywhere, it looks very inefficient, but it may work.

Any other option to answer your question? :slight_smile:

2 Likes