Backpropagation step

Kerem_Kezer · February 29, 2024, 9:21pm

In the backpropagation step, we take derivatives based on the weights. But maybe some weights fit well. For example dj/dw maybe 0.00000001 so my intuition is that we can stop working with these neurons and train other neurons but why don’t we do that? Where am I thinking wrong?

TMosh · February 29, 2024, 9:33pm

There are methods that do what you suggest. But they are mathematically more complicated, so they’re generally not covered in these courses.

Having a big / fast enough computer is not nearly the issue that it used to be. So efficiency isn’t quite as important as being able to handle large data sets and complex models.

paulinpaloalto · February 29, 2024, 9:39pm

Back propagation generates individual corrections for every weight (parameter). If that value is very close, then the gradient (and hence the update) should be very small. For the values that still need to change, the gradients will tell you that. So what value does it provide if you try to subset the ones that matter and don’t? The point is that the algorithm should take care of that for you.

Of course getting good convergence is not a guarantee. You may need more sophisticated versions of gradient descent (e.g. adaptive learning rates) and even then it may just be that your model is not expressive enough for your data. What I’m describing above is the behavior in the good case that you do get reasonable convergence.

Or am I missing some more subtle point that you are making here?

Kerem_Kezer · February 29, 2024, 9:51pm

What I mean is, for example, if we have 1000 neurons in a layer and 300 of them are convergent, those 300 neurons have very low derivatives. Since these are optimized, we can stop working with them. In the next steps of the algorithm we can train 700 neurons instead of 1000 to reduce the computational cost

TMosh · February 29, 2024, 10:04pm

Yes, this is true. But in practice this is a slow process, so it isn’t used very much.

paulinpaloalto · February 29, 2024, 10:36pm

But the point is that doing that just makes the code a lot more complicated and what does it really buy you? As I explained above, the algorithm just takes care of that for you. It’s all vectorized, so why does it matter that some of the gradient values are close to zero?

If my argument doesn’t convince you, then I suggest you try writing the code to implement your idea and see how it goes.

Kerem_Kezer · March 1, 2024, 8:31am

I understand the key point. Thanks for the explanations

Topic		Replies	Views
Reason for using BackProp for calculating derivative Advanced Learning Algorithms week-2	3	382	December 2, 2023
So, why do we need back propogation? Improving Deep Neural Networks: Hyperparameter tun	4	491	May 11, 2023
Backprop derivatives Advanced Learning Algorithms week-2	1	474	May 13, 2023
Didn't understand how gradient computation using back prop is order of N+P Advanced Learning Algorithms week-2	4	254	February 26, 2024
Backpropagation when using dropout and Regularization Improving Deep Neural Networks: Hyperparameter tun	5	601	February 11, 2022

Backpropagation step

Related topics