I have recently completed the Deep Learning Specialization (DLS). According to the DLS course, the correct sequence of steps before performing gradient descent is as follows:
- Initialize parameters
- Classify or predict
- Calculate loss
- Calculate gradients
- Update parameters
However, I am confused by this slide. Am I missing something here? Could someone please explain this to me? Thank you!
Yes, the key point there is that the actual loss value is not used in computing the gradients. The only purpose of the actual J value itself is as an inexpensive proxy for whether you are getting convergence or not. So it doesn’t matter whether you calculate that before or after you compute the gradients. The gradients are just functions that are affected by the loss but they don’t actually depend on the scalar J value: you just evaluate the functions according to Prof Ng’s formulas.
A better proxy for your convergence is the actual prediction accuracy on the training data (and optionally the validation data), since that’s what you actually base your goals on. And it turns out that there is not a monotonic relationship between cost and accuracy, because accuracy is quantized. The other thing to realize about J is that it’s essentially meaningless by itself. E.g. it’s not comparable between two different models. For convergence you just want to look at the graph of the cost versus iterations to get a picture of what’s happening. Accuracy is more expensive to compute, so the common practice is to evaluate that only every 100 or 500 or 1000 iterations and to just look at the cost graph to get a picture for how convergence is working.
3 Likes
Very well explained!!! Thank you for taking time and explaining in detail!