Nan when trying different learning rates

Hi, first of all thanks so much for the help provided!

I’m doing the additional exercise of the programming assignment “Logistic Regression with a Neural Network Mindset” and when trying learning rates above 0.3, the costs start to return nan values.

I’m wondering, why nan values?

Here is an output example:

{'0.03': {'costs': [0.6931471805599453,
   0.6047106658463652,
   0.6728389276846456,
   1.491366534790886,
   5.979899890585064],
  'Y_prediction_test': array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0.]]),
  'Y_prediction_train': array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.]]),
  'w': array([[ 0.02593027],
         [-0.06679975],
         [-0.03315306],
         ...,
         [-0.03863072],
         [-0.08253615],
         [ 0.04794216]]),
  'b': 0.017047387946632686,
  'learning_rate': 0.03,
  'num_iterations': 500,
  'train_accuracy': 69.377990430622,
  'test accuracy': 32.0},
 '0.1': {'costs': [0.6931471805599453,
   23.940823460057516,
   21.36734132511805,
   nan,
   nan],
  'Y_prediction_test': array([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1.,
          0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1.,
          1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
          1., 0.]]),
  'Y_prediction_train': array([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
          0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1.,
          0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
          1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
          1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.,
          0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0.]]),
  'w': array([[ 0.08106772],
         [-0.19724363],
         [-0.09063479],
         ...,
         [-0.08314097],
         [-0.21857226],
         [ 0.17897008]]),
  'b': 0.07480303954616058,
  'learning_rate': 0.1,
  'num_iterations': 500,
  'train_accuracy': 93.77990430622009,
  'test accuracy': 70.0},
 '0.3': {'costs': [0.6931471805599453, 76.68188461592068, nan, nan, nan],
  'Y_prediction_test': array([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1.,
          0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1., 0., 1.,
          1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 1.,
          1., 0.]]),
  'Y_prediction_train': array([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
          0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
          0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
          0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 1.,
          0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
          1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
          1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.,
          0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 1., 1.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0.]]),
  'w': array([[ 0.25782703],
         [-0.59304956],
         [-0.27471   ],
         ...,
         [-0.27457602],
         [-0.68860534],
         [ 0.52808877]]),
  'b': 0.2245474856824589,
  'learning_rate': 0.3,
  'num_iterations': 500,
  'train_accuracy': 95.2153110047847,
  'test accuracy': 70.0},
 '1': {'costs': [0.6931471805599453, nan, nan, nan, nan],
  'Y_prediction_test': array([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1.,
          0., 1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 1.,
          1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 1., 1., 0., 1.,
          1., 0.]]),
  'Y_prediction_train': array([[0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1.,
          0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0.,
          0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
          0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.,
          0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1.,
          0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0.,
          1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
          1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0.,
          0., 0., 1., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0., 1., 1.,
          0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0.,
          1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0.]]),
  'w': array([[ 0.87359725],
         [-1.98889914],
         [-0.90930925],
         ...,
         [-0.93824708],
         [-2.30327515],
         [ 1.78433091]]),
  'b': 0.8406488773588461,
  'learning_rate': 1,
  'num_iterations': 500,
  'train_accuracy': 95.2153110047847,
  'test accuracy': 72.0}}
1 Like

Maybe take a peek at the video on vanishing and exploding gradients in the next course to get a sense of what is happening. Based on the increasing costs, you could also search on deep learning diverge. Basically, your building starts swaying, and instead of being a damped oscillation it goes further from the center each time until it falls apart.

ps: +1 for running the experiments. Only way to learn this stuff IMO

1 Like

Hi, @Joseph_Arcila. The short answer is that with a higher learning rate you are forcing the model’s gradient, and by implication the weights and the cost, into “explosive” behavior. The numbers become so large that they eventually exceed machine precision, or the ability of the computer to represent such large numbers. That produces the nan values, kind of a machine infinity.

As @ai_curious points out in his skyscraper metaphor, gradient descent is like a dynamic process. Generically, similar sorts of equations are used to model dynamic phenomena. To continue the dynamic interpretation, the current iteration’s value of the parameters, w and b, are equal to previous “iteration’s” value plus/minus some forcing. Here, the “force” is applied by the gradient multiplied by the learning rate where the gradient is evaluated at previous iteration’s values. If that force becomes too big, the solution will explode. The next iteration’s values are under the control of the current iteration’s. It might do so smoothly, or it may exhibit oscillations (back to the skyscraper). Here, the explosive behavior was instigated by a learning rate that is too large.

In this plain vanilla flavor of gradient descent, there is no mechanism to dampen the explosive behavior “on the fly.” Those techniques you will learn in Course 2, which I am guessing, you will like a lot! :nerd_face:

2 Likes

current iteration’s ?

Duly corrected! Thanks, @ai_curious.

Actually the NaN’s are not because the numbers exceed the absolute value of what can be represented in 64 bit floating point: it’s that we “saturate” the output values of sigmoid so that they round to exactly 0 or exactly 1. Once you do that, the cost ends up taking the log(0) which is -\infty and then when you try to do arithmetic with that, you end up with NaN. It turns out that it’s actually pretty easy to saturate sigmoid on the positive size: sigmoid(37.0) will do it. On the negative side, you have to go quite a bit further. I don’t have the numbers handy, but I think it’s like z < -760 or thereabouts to get exactly 0. Of course when we are dealing with the abstract beauty of \mathbb{R}, the output of sigmoid will never exactly equal 0 or 1.

1 Like

@paulinpaloalto are all in agreement with that statement?

My admittedly limited understanding is that vanilla Python, numpy, and TensorFlow each have slightly different behavior when attempting arithmetic involving infinity or otherwise problematic domain values ie do they throw a particular exception or return nan. So it may matter a tiny bit exactly what activation and cost functions a network is composed of as well as how they are implemented. Nonetheless, the general takeaway is that too large learning rates can lead to instabilities, sometimes very quickly.

1 Like

Yes, sorry, I didn’t mean to make it sound like I was disagreeing with the fundamental analysis here. We are all in agreement with @kenb’s statement:

Here, the explosive behavior was instigated by a learning rate that is too large.

I was just trying to give one more layer of detail about how the NaN values actually occur. I think that actually the IEEE 754 standard is very explicit about the meaning of NaN and Inf values and how they propagate in computations, but it is also probably the case that not all implementations of the spec get it exactly right.

I agree that the details of which activation and cost functions you use also determine exactly how the “exploding gradient” behavior will manifest. If you’re using ReLU or Leaky ReLU and MSE for example, then you’ll probably just get crazy divergence (manifested by rapidly increasing cost values) rather than NaN for the cost values. You may eventually hit Inf, since the largest number you can represent using binary64 IEEE 754 floats is roughly 10^{+308}.

1 Like

On my 2020 M1 Mac Book running Monterey Mac OS 12.1 Beta with Apple’s version of TF installed, here is the result of a division by zero in vanilla Python 3.9.6:

num = 0
denom = 0
py_ratio = num/denom
print(py_ratio)

Here it is using numpy 1.19.5 arrays:

num = np.zeros((4,4), int)
denom = np.zeros((4,4), int)
numpy_ratio = num/denom
print(numpy_ratio)

Finally, tf.Variables in TensorFlow 2.5.0

num = tf.Variable(initial_value=0.0)
denom = tf.Variable(initial_value=0.0)
tf_ratio =  num / denom
print(tf_ratio)

Python throws an exception and doesn’t assign any value,
numpy produces the nan but generates a Runtime Warning,
TensorFlow creates the nan with no whining
:man_shrugging:

@Joseph_Arcila hope you don’t mind us geeking out a little bit on your thread!

1 Like

@ai_curious: Hey, I resemble that remark! :nerd_face:

0/0 is actually the hard case. If the numerator is non-zero and finite, then you would expect to get \infty or -\infty and IEEE 754 supports that. Here some more examples of how divide by zero works and how Inf values can propagate in computations according to the IEEE 754 rules:

This is using one of our notebooks with:

np.__version__
'1.18.4'
tf.__version__
'2.3.0'

Here we go with numpy:

v = 42. * np.ones((1,4), dtype = 'float64')
print(f"type(v) = {type(v)}")
print(f"v = {v}")
w = np.zeros((1,4), dtype = 'float64')
print(f"type(w) = {type(w)}")
print(f"w = {w}")
z = v / w
print(f"z = {z}")
a = -1. * z
print(f"a = {a}")
b = z + 42.
print(f"b = {b}")
c = z - 42.
print(f"c = {c}")
d = z + z
print(f"d = {d}")
e = z - z
print(f"e = {e}")
f = z / z
print(f"f = {f}")

Running that gives this result:

type(v) = <class 'numpy.ndarray'>
v = [[42. 42. 42. 42.]]
type(w) = <class 'numpy.ndarray'>
w = [[0. 0. 0. 0.]]
z = [[inf inf inf inf]]
a = [[-inf -inf -inf -inf]]
b = [[inf inf inf inf]]
c = [[inf inf inf inf]]
d = [[inf inf inf inf]]
e = [[nan nan nan nan]]
f = [[nan nan nan nan]]

So you can see that doing arithmetic with Inf propagates in ways that make sense, until you try to subtract or divide two of them.

You get the same results with TF:

tfv = 42. * tf.ones((1,4), dtype = tf.dtypes.float64)
tfw = tf.zeros((1,4), dtype = tf.dtypes.float64)
print(f"type(tfv) {type(tfv)}")
print(f"tfv = {tfv}")
print(f"type(tfw) {type(tfw)}")
print(f"tfw = {tfw}")
z = tf.divide(tfv, tfw)
print(f"type(z) {type(z)}")
print(f"z = {z}")
a = -1. * z
print(f"type(a) {type(a)}")
print(f"a = {a}")
b = z + 42.
print(f"b = {b}")
c = z - 42.
print(f"c = {c}")
d = tf.add(z, z)
print(f"d = {d}")
e = tf.subtract(z, z)
print(f"e = {e}")
f = tf.divide(z, z)
print(f"f = {f}")

Which yields this:

type(tfv) <class 'tensorflow.python.framework.ops.EagerTensor'>
tfv = [[42. 42. 42. 42.]]
type(tfw) <class 'tensorflow.python.framework.ops.EagerTensor'>
tfw = [[0. 0. 0. 0.]]
type(z) <class 'tensorflow.python.framework.ops.EagerTensor'>
z = [[inf inf inf inf]]
type(a) <class 'tensorflow.python.framework.ops.EagerTensor'>
a = [[-inf -inf -inf -inf]]
b = [[inf inf inf inf]]
c = [[inf inf inf inf]]
d = [[inf inf inf inf]]
e = [[nan nan nan nan]]
f = [[nan nan nan nan]]
1 Like

Thanks for answering!
I’ll definitely complete the next course.
The increasing costs are because I run just 500 iteration for the simplicity of demonstrating here the nan values, but when running for more than 1500 iterations the costs start to drop.

ps: I’m running the experiments also so I can use this notebook and personalize it as part of my portfolio. Do you think that’s ok?

Thanks so much for that extra layer of explanation.

1 Like

Wow I loved it that my question triggered such an interesting discussion

1 Like

I’m not an intellectual property lawyer, but here’s my take. Publicly sharing the course notebooks, with or without your solutions included, violates the terms of use. Publicly sharing the course notebooks even with a few cells added for trying different learning rates also violates the terms of use. Taking what you learned from these exercises, building a simple network, running it with different learning rates, and providing your own narrative about why you did it and what you observed is exactly what I would hope to see from a candidate with an online portfolio of their learning journey. Good luck.

2 Likes