Hello everyone, I am a newbie in deep learning. I have been following Prof Andrew’s YouTube broadcasts and in the 4th week of the deep learning course on Cousera.

Please I need help on a problem I encountered while working on a project on image recognition of geophyscal well logs.

I decided to try out logistics regression first but the algorithm fails to converge at every trial. I always get a Nan value for the cost function just after the first iteration. When I tried computing the problem one after the other… I found out that the algorithm keeps getting an overflow on the exp in the sigmoid function immediately after the first iteration.

My image data is normalized and I have tried 64×64, 150×150 and even 1000×1000 resolutions of the images but the result is still same…

I really want to find out the problem … any help?

It’s great that you are trying to apply the ideas from the courses to a new problem!

The first question would be how you did the data normalization. What are the pixel values in your raw input data and what algorithm did you use for “normalization”? Also I assume you are starting with zero as the initial values for your weights and bias, right? In the case of Logistic Regression (unlike real Neural Networks), you don’t need “symmetry breaking” in the initialization.

Note that the usual cause of NaN is “saturating” the output of *sigmoid* so that it rounds to exactly 0 or 1, which then causes the loss function compute *log(0)* and then do arithmetic with the result. It turns out to be pretty easy to saturate *sigmoid* on the positive side: *sigmoid(37.0)* will do it in *float64*.

The next thing to check would be to make sure your gradient calculations are correct. Did you write the code from scratch or are you using an implementation that (say) passes the tests in the Logistic Regression Assignment in DLS C1 W2?

Thank you for the reply!

The maximum and minimum pixel values of the raw data are 255 and 0 respectively. The normalization was just dividing all pixel values by 255… to give range of values ranging from 0 to 1.

Yes I initialized the parameters and bias to 0 since it’s just logistic regression.

The rounding to 0 and 1… Yes that is exactly what happens! Recently, before I decided to ask out, I did the computation of each step to find out the problem. I noticed that the z values after the first iteration become very large or very small ( even above -+ 800!). This causes overflow in the exponent of the sigmoid and most of the values are approximated to 0 while the rest will be very close to zero. By the next iteration the z values become even larger or smaller than the previous one and the circle kept on… but I still couldn’t figure out why it kept happening for that particular data.

Yes, I wrote the code from scratch, infact I have rewritten 3 times corresponding to 3 different resolutions which I used(I initially thought the number of features might be impacting on the data). My gradient calculations are same as from the course. dz = A-Y. dw = x.dz (np.dot(X,dz.T)/m). db = sum(dz)

In addition to this… (I had passed my tests for both logistic regression and shallow nn in the course) I downloaded the notebook and files of my logistic regression test. Ran it on my PC to confirm it had no errors due to transfer. It came out with same result as it did during the test. I then adjusted the “load data” function to import the files of my work instead. That was successful but the same problem still happened when I tried running the model. At that point I concluded that the code might not be the problem … but the data itself. But then I cant still figure out what is wrong with the data and how to fix it.

My data is very small too. 37 in total. I don’t know if the problem is from there??

*Of course db was divided by m too.

Interesting. Everything you say sounds like you are doing the right thing. Dividing the pixel values by 255 should be exactly the right approach. But something is really wrong if you hit z values of 800 on the very first iteration starting with all zero weights and biases. That just couldn’t happen with zero coefficients, right? The output Z will be zero because all the w_i and b are zero. So then you take *sigmoid(0)* and get 0.5 for A. So the gradient of w will be:

dw = \displaystyle \frac {1}{m} X \cdot (0.5 - Y)^T

where all the Y values are 0 or 1, right? So with m = 37, the values that multiply the elements of X are going to be either 0.5 or -0.5 times \displaystyle \frac {1}{37}. If you start with numbers < 1, then the gradients are not going to be very big. Then what are you using for the learning rate?

Or do you mean that it’s the second iteration that supplies z values in the range of 800? If that’s the case, are you sure you are really using the normalized images and not the raw images?

The first iteration comes out well… the problem starts after the first iteration, on the second, and then through all the other iterations.

Yes, I use the normalized image, I always double check on that.

I have tried several learning rates, 0.1, 0.01, 0.05, 0.5 and so many others I can’t remember… the only impact i noticed is that the lower learning rates like 0.01 pushed the problem over to the third iteration. So the first and second iterations come out well but the third has the same problem. But I didn’t give much thought to this because the cost it produces on the second iteration is usually larger than the one on the first (about 8.87 and 0.63 respectively).

But then gradient descend follows path of the gradient… so those values of cost shouldn’t be right.

I don’t know if it is possible and also ok with you that I send my data and my notebook.

As part of my practice, I had converted the data to h5py file of numpy arrays.

It is odd that the cost moves in the wrong direction. Are you sure you checked your “update parameters” logic? E.g. did you perhaps do

W = W + \alpha * dW

instead of

W = W - \alpha * dW

If that’s not it, then I guess we’ll need to start actually examining code, but we don’t want to do that in a public setting. We can use DMs.

1 Like

I tried both. both had same result.

I think I found the problem just now!!.. I decided to do a quick run of the algorithm again before sending them… so I tried a low learning rate of 0.0001 and it ran without problem! ending with a cost of 0.16… after 1000 iterations!

I think I get the idea now!.. On one of Prof. Andrew’s broadcast on YouTube, he mentioned gradient descent exceeding the local minima if a high learning rate is chosen for a particular data… I think that was the problem… hyperparameter!..

Before now I believed any learning rate can work on any data… and that learning rates of 0.01 was low… so I was always choosing rates greater than or equal to 0.01.

Now that I see it… It seems that my data, or at least some of them, had abundance of high pixel values as such had values close to 1 after normalization … by choosing high learning rate… the gradients were so large that after a step they crossed the local minima… that would also explain the extremely high and low z values I got…

This was definitely enlightening…

That is great news that you figured out the learning rate issue. I know that Prof Ng said that there is never a guaranteed choice of value for \alpha that works in all cases. It’s surprising that you have to set it so low with your data, but that is a valuable learning experience! And we hope that other students will see this and also gain knowledge from your experience.

Yes. Thank you very much for the help sir!