Please let me ask a question on “vanishing gradient problem”. I read that it occurs during the training of deep neural networks, where the gradients that are used to update the network become extremely small during backpropagation from the output layers to the earlier layers.
But is it a problem of backpropagation itself? Backpropagation just calculates in an effective way the gradient of the cost function. If some components of the gradient are indeed small, this is not the problem of backpropagation approach.
Will be grateful for the answer.
My best regards,
Back-propagation is the way the Neural Networks function, it is the key to their functions and rise. The neurons in a large neural network because the network wants to spread contribution to all of the, there comes a point that not much further improvement can be achieved so gradients start diminishing.
Is this a problem of back-propagation, well maybe, but we dont have any other effective ways to use in neural networks at the moment.
If we consider "vanishing gradient " a problem, then the cause of the problem is indeed that some gradients are too small for other gradients that depend on them to be expressive.
I believe this is your idea too, so if I might add to some of your statments, I would say
Backprop doesn’t cause the problem, small gradient values cause it.
Backprop doesn’t give us any wrong gradient values, including those small ones. But what can give us small gradient values? You know that, we know that, some activation functions can, a deep network might.
Why is it a problem at all? We can’t train the network anymore.
Thank you, Raymond!
So, it is not a problem of BP itself.
Not to me I agree with you that BP is just an effective way. If we don’t blame gradient descent, why we can blame BP. Those methods that address the vanishing gradient problem are not banishing BP - at least this is what I know.
Right! The other point worth making here is that (as mentioned in the original post on this thread) this problem is more likely to happen when training deep networks. The reason is simple: the way gradients are computed is using the Chain Rule, because the gradients at each layer of the network are derivatives of J, the final cost value, w.r.t. the parameters at that layer. So think about what that means: the Chain Rule gives you the products of all the gradients at each later layer between the current layer and the output layer cost function, right? This can cause two kinds of problems when networks get deep, because think about what happens when you compute the product of a bunch of factors:
- If the factors are all < 1, then the more factors you have, the smaller the absolute value of the result.
0.1 * 0.1 = 0.01 and so forth, right?
- If the factors are > 1, then the more factors you have, the larger the absolute value of the result.
In the vanishing gradient case, it’s the first problem that we are concerned about and that makes it clearer why the deeper the network is the more problem we may encounter here.
Fortunately for us, researchers like Prof Ng and his colleagues have figured out ways to mitigate these problems, by using ideas like those in the Residual Network architecture so that back propagation can still be effective even in a very deep network.
Of course item 2) above applies in the “exploding gradient” case. Even if individual gradients per layer are not that large but still > 1, if you multiply a lot of them together, things can spiral out of control.
Thank you for such a nice answer!
As I understand, the problem for vanishing or exploding gradients is not due to BP but would be present for any method which computes the gradient of the cost function. But due to BP we can easily understand why it happens.
By the way, please let me ask one more question.
For logistic regression the cross-binary-entropy cost function is convex and so have one local minimum. Will this be true for an arbitrary NN having sigmoid as an activation function?
No, as soon as you graduate from Logistic Regression to real Neural Networks, the cost surfaces are no longer convex and there are lots of local optima. Prof Ng will discuss this just briefly here in Course 1, but it turns out that this does not really cause that much of a problem. Here’s a thread which discusses this in a bit more detail and gives some links to relevant papers.
Thank you!! So interesting material about multiple local minima.