DLS Course 1 Week 2 warning when running the logistic regression model

When I run the logistic regression methods in another environment with a different dataset of images, I’m getting the below warning in quotes when running the following:
logistic_regression_model = model(x_train, y_train, x_test, y_test, num_iterations=200, learning_rate=0.005, print_cost=True)

“RuntimeWarning: divide by zero encountered in log
RuntimeWarning: invalid value encountered in multiply”

The shape of x_train and y_train are (48387, 5216) and (1, 5216) respectively.
I also have the following output:
train accuracy: 93.05981595092024 test accuracy: 80.28846153846155

when training the model with learning rate: 0.01, I see breaks in the line for the cost in the graph, is this the vanishing gradient descent that I’ve heard of?
Training the model with learning rate: 0.001 and learning rate: 0.0001, I do not see this warning nor do I see breaks in the graph for the cost.

1 Like

That error most likely means that you have “saturated” sigmoid for some training samples, meaning that the output of sigmoid rounds to exactly 0 or 1, which then causes the cost function to take log(0) and then get NaN as the cost. Note that this normally isn’t really that much of a problem because the gradients still make sense without the scalar cost value. Believe it or not the actual J output value isn’t really used for anything other than judging how your convergence is working.

For positive values of z it only takes z > 37 to saturate sigmoid in 64 bit floating point. This theory would be consistent with being less likely if you use a smaller value of the learning rate. You could say that the problem is that your convergence is working too well. :scream_cat: There are several ways to investigate and proceed:

You can add a check to see if you are actually getting any values of sigmoid(z) which exactly equal 1. And if so, you could substitute a slightly smaller value to avoid the problem with the cost.

Or you could switch from using the J value to using training accuracy to judge whether your convergence is working or not. E.g., every 500 or 1000 iterations, compute the training accuracy and use that to judge whether convergence is working well or not instead of the J value.

1 Like

Thank you for responding. When you say “the problem is that your convergence is working too well”, is this the overfitting issue? When I used new images for prediction, it does not perform very well, so I’m assuming it’s overfitting to the training data and testing data?

1 Like

If you are seeing 93% training accuracy and 80% test accuracy (as you showed in the one example), that would qualify as overfitting. And it would also not be too surprising if the model did not do very well on images not in either the training or test set.

But before we can draw too many conclusions about what is going on, it would help to have a bit more information. What is the total size of your training and test datasets? Note that 200 iterations is not very many in the grand scheme of things. Of course everything is situation dependent, but typically in the assignments here the iteration counts are at least in the range of 10^3 and as high as 10^4 in some cases.

Also note that Logistic Regression is not very powerful for general purpose image recognition tasks. Before investing a lot more effort in getting LR to work well with your example, it might be better to wait until you’ve seen the material in Week 4 of DLS Course 1 and try comparing the performance of LR to that of a 3 or 4 layer network like the ones Prof Ng shows us in Week 4.

Also note that the cat recognition task that we have here is really difficult with the small datasets that we have. 209 training samples and 50 test samples is unrealistically small for a task this complex. It’s kind of amazing that we get as good results as we do, but I suspect that the datasets were carefully “curated” to make that possible within the severe memory limitations of the online notebook environment. Here’s another thread that shows some experiments with rebalancing the 209/50 datasets that seems to support the “careful curation” theory.