Backpropagation step

In the backpropagation step, we take derivatives based on the weights. But maybe some weights fit well. For example dj/dw maybe 0.00000001 so my intuition is that we can stop working with these neurons and train other neurons but why don’t we do that? Where am I thinking wrong?

There are methods that do what you suggest. But they are mathematically more complicated, so they’re generally not covered in these courses.

Having a big / fast enough computer is not nearly the issue that it used to be. So efficiency isn’t quite as important as being able to handle large data sets and complex models.

Back propagation generates individual corrections for every weight (parameter). If that value is very close, then the gradient (and hence the update) should be very small. For the values that still need to change, the gradients will tell you that. So what value does it provide if you try to subset the ones that matter and don’t? The point is that the algorithm should take care of that for you.

Of course getting good convergence is not a guarantee. You may need more sophisticated versions of gradient descent (e.g. adaptive learning rates) and even then it may just be that your model is not expressive enough for your data. What I’m describing above is the behavior in the good case that you do get reasonable convergence.

Or am I missing some more subtle point that you are making here?

What I mean is, for example, if we have 1000 neurons in a layer and 300 of them are convergent, those 300 neurons have very low derivatives. Since these are optimized, we can stop working with them. In the next steps of the algorithm we can train 700 neurons instead of 1000 to reduce the computational cost

Yes, this is true. But in practice this is a slow process, so it isn’t used very much.

1 Like

But the point is that doing that just makes the code a lot more complicated and what does it really buy you? As I explained above, the algorithm just takes care of that for you. It’s all vectorized, so why does it matter that some of the gradient values are close to zero?

If my argument doesn’t convince you, then I suggest you try writing the code to implement your idea and see how it goes.

I understand the key point. Thanks for the explanations