Size of the expected output does not make sense

Hi!
I have an issue in this function :

def optimize(w, b, X, Y, num_iterations=100, learning_rate=0.009, print_cost=False):
    for i in range(num_iterations):
.
.
.
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
.
.
.
    return params, grads, costs

Logically, the costs array should be of size 1 because parameter i go from 0 to 99
But the error I get implies otherwise :

AssertionError: Wrong values for costs. [array(5.80154532), array(nan)] != [5.80154532, 0.31057104]

Please take a look at optimize_test which invokes optimize function with 101 iterations like this:

params, grads, costs = target(w, b, X, Y, num_iterations=101, learning_rate=0.1, print_cost=False)
1 Like

That error message is telling you that you got the correct cost on the first iteration, but you got NaN for the cost after 100 iterations. Most likely that means that your logic for updating the parameters is incorrect somehow. E.g. did you add the gradient values, instead of subtracting them?

1 Like

thank you everybody you’ve been very helpful!
(yes I added them how did you know :joy:)

do you know why adding them result in a Nan? instead of just resulting in an incorrect number

The NaN is caused by getting 0 * log(0) from either the y * log(\hat{y}) term or the (1 - y) * log(1 - \hat{y}) term. log(0) is -\infty, but 0 * -\infty gives you NaN (not a number) in floating point.

Now that I think a bit harder about it, I’m not sure why adding will cause that, but I’ve seen this syndrome a number of times before and that’s usually the cause. :nerd_face:

Let me run the experiment myself and look at the actual numbers I get. Stay tuned!

@chay_guez starting to wonder what an ‘incorrect number’ would mean :flushed:

And presuming, though it is not stated, one understands ‘NaN’ stands for ‘Not a number’.

Reminds me of my Harvard Biology Professor who once stated this case:

How NASA Lost Its Mars Climate Orbiter From a Metric Error!

Personally, I’d rather have the computer tell me ‘you have no idea what you are doing !’-- Rather than getting an answer that is wrong :wink:

Before I dig into the details of what actually happens in this case, here is my general experiment to show the behavior of things like divide by zero and how Inf and NaN propagate in floating point computations:

v = 42. * np.ones((1,4), dtype = 'float64')
print(f"type(v) = {type(v)}")
print(f"v = {v}")
w = np.zeros((1,4), dtype = 'float64')
print(f"type(w) = {type(w)}")
print(f"w = {w}")
z = v / w
print(f"z = {z}")
a = -1. * z
print(f"a = {a}")
b = z + 42.
print(f"b = {b}")
c = z - 42.
print(f"c = {c}")
d = z + z
print(f"d = {d}")
e = z - z
print(f"e = {e}")
f = z / z
print(f"f = {f}")
g = np.log(w)
print(f"g = {g}")
h = np.log(-1 * v)
print(f"h = {h}")

Running the above gives this:

type(v) = <class 'numpy.ndarray'>
v = [[42. 42. 42. 42.]]
type(w) = <class 'numpy.ndarray'>
w = [[0. 0. 0. 0.]]
z = [[inf inf inf inf]]
a = [[-inf -inf -inf -inf]]
b = [[inf inf inf inf]]
c = [[inf inf inf inf]]
d = [[inf inf inf inf]]
e = [[nan nan nan nan]]
f = [[nan nan nan nan]]
g = [[-inf -inf -inf -inf]]
h = [[nan nan nan nan]]

Note that this behavior is specified by the IEEE 754 spec, so it it not specific to numpy. It is implemented in the lower level math libraries and in the FP hardware. I have another cell that runs the same experiments in TF instead and the results are exactly the same.

1 Like

@paulinpaloalto okay, Paul, looking at this I have a weird question I don’t know the answer to…

How exactly (or even mathematically), does negative infinity differ from positive infinity ?

*And, at least someone here has been reading Douglas Adams.

It’s negative instead of positive. :laughing: But seriously folks, it’s a good question. From a functional standpoint, it probably doesn’t make a lot of difference, since (as demonstrated above) you can’t do arithmetic with it.

But to be slightly more serious, look at the graph of log(z) for 0 < z \leq 1:

So you can see that as z \rightarrow 0 the value of log(z) gets larger in magnitude, but the sign is negative for the entire domain 0 < z < 1. The closer z gets to 0, the bigger a negative number we get for log(z).

So from a mathematical standpoint, you’d express that using limits:

\displaystyle \lim_{z \rightarrow 0}log(z) = -\infty

By contrast, if you look at the graph of e^z, then it’s clear that:

\displaystyle \lim_{z \rightarrow \infty}e^z = +\infty

:grin: Okay, I get that part; But recently I am thinking about ‘race conditions’-- So if the equation ‘heads south’ (or to ‘true north’) is this telling us something?

Or the direction (i.e the sign) does not matter ?

I’m not sure what you are really asking. In my experience the term “race conditions” means you are doing parallelization and you have done things in a way that is non-deterministic. That’s generally ok, although “race conditions” can also be bugs in your code, meaning you need to have locking to protect the update of certain shared “state” variables that are used by all the parallel threads.

Not sure how that applies to race conditions, but the point of my previous post about the difference between -\infty and +\infty was that the direction does matter, right?

I think I must just be missing your point.

Paul, I’ve been playing around with this CUDA stuff and it gets hard once you start to implement custom streams. At least the default is ‘always blocking’, so you can rely on that. At that point it is ‘not so easy’ and you can kind of have all sorts of values floating in at any time.

At first I kind of scoffed when they included an algorithm checker at the end of every routine; I get it now though, it is not that hard to have the GPU produce you results… But they could also be horribly wrong.

I guess what I was wondering, from a ‘practical’ point of view, -Inf = Inf ? Or, no…

When you implement parallel algorithms, correctness matters, as it always does. But the point is it’s a lot harder to get right. E.g. consider the classic “single queue, multi server” model for parallelizing a task. You break the work to be done up into chunks and create a queue of those chunks. Then you unleash your parallel worker threads and they each do:

Fetch next item from queue
Complete task
Record results
Repeat

There are (at least) two points in that process where all the threads are accessing a shared data structure: the work queue and the results. If you don’t provide proper locking so that the updating of the queue and the results are actually single threaded, then your results are garbage.

Of course that’s just one simple example. So maybe you can argue that it doesn’t matter whether your wrong results are too negative or too positive. Wrong is wrong, right? :smile:

I’m about to get busy with “real life” for the next 4 or 5 hours, so I won’t be able to make any further progress for a while on the original question here.

So you know… I am not afraid to ask for help, when I feel I need it:

And no-- We replied at the same time, but that is just chance. Paul, go out and enjoy yourself :grin:.

Best,
-A

Sorry, I won’t be able to answer that question from the CUDA thread without basically taking that course myself, since you are using a bunch of technical terms that are not generic but are very specific to how their APIs work.

Yes, they’re on opposite ends of the number line.

@TMosh Damn it. For a short time everyone benefits (but I did not include the solution you need to pass);

And I get the positive/negative part. Perhaps I should be more blunt-- Why does it slide one way or another ? I think that has to be saying ‘something’… But I don’t know what it is…

Paul’s examples provide results for both +Inf and -Inf. You can study those to see what causes each.