def optimize(w, b, X, Y, num_iterations=100, learning_rate=0.009, print_cost=False):
for i in range(num_iterations):
.
.
.
# Record the costs
if i % 100 == 0:
costs.append(cost)
.
.
.
return params, grads, costs

Logically, the costs array should be of size 1 because parameter i go from 0 to 99
But the error I get implies otherwise :

AssertionError: Wrong values for costs. [array(5.80154532), array(nan)] != [5.80154532, 0.31057104]

That error message is telling you that you got the correct cost on the first iteration, but you got NaN for the cost after 100 iterations. Most likely that means that your logic for updating the parameters is incorrect somehow. E.g. did you add the gradient values, instead of subtracting them?

The NaN is caused by getting 0 * log(0) from either the y * log(\hat{y}) term or the (1 - y) * log(1 - \hat{y}) term. log(0) is -\infty, but 0 * -\infty gives you NaN (not a number) in floating point.

Now that I think a bit harder about it, Iâm not sure why adding will cause that, but Iâve seen this syndrome a number of times before and thatâs usually the cause.

Let me run the experiment myself and look at the actual numbers I get. Stay tuned!

Before I dig into the details of what actually happens in this case, here is my general experiment to show the behavior of things like divide by zero and how Inf and NaN propagate in floating point computations:

v = 42. * np.ones((1,4), dtype = 'float64')
print(f"type(v) = {type(v)}")
print(f"v = {v}")
w = np.zeros((1,4), dtype = 'float64')
print(f"type(w) = {type(w)}")
print(f"w = {w}")
z = v / w
print(f"z = {z}")
a = -1. * z
print(f"a = {a}")
b = z + 42.
print(f"b = {b}")
c = z - 42.
print(f"c = {c}")
d = z + z
print(f"d = {d}")
e = z - z
print(f"e = {e}")
f = z / z
print(f"f = {f}")
g = np.log(w)
print(f"g = {g}")
h = np.log(-1 * v)
print(f"h = {h}")

Running the above gives this:

type(v) = <class 'numpy.ndarray'>
v = [[42. 42. 42. 42.]]
type(w) = <class 'numpy.ndarray'>
w = [[0. 0. 0. 0.]]
z = [[inf inf inf inf]]
a = [[-inf -inf -inf -inf]]
b = [[inf inf inf inf]]
c = [[inf inf inf inf]]
d = [[inf inf inf inf]]
e = [[nan nan nan nan]]
f = [[nan nan nan nan]]
g = [[-inf -inf -inf -inf]]
h = [[nan nan nan nan]]

Note that this behavior is specified by the IEEE 754 spec, so it it not specific to numpy. It is implemented in the lower level math libraries and in the FP hardware. I have another cell that runs the same experiments in TF instead and the results are exactly the same.

Itâs negative instead of positive. But seriously folks, itâs a good question. From a functional standpoint, it probably doesnât make a lot of difference, since (as demonstrated above) you canât do arithmetic with it.

But to be slightly more serious, look at the graph of log(z) for 0 < z \leq 1:

So you can see that as z \rightarrow 0 the value of log(z) gets larger in magnitude, but the sign is negative for the entire domain 0 < z < 1. The closer z gets to 0, the bigger a negative number we get for log(z).

So from a mathematical standpoint, youâd express that using limits:

Okay, I get that part; But recently I am thinking about ârace conditionsâ-- So if the equation âheads southâ (or to âtrue northâ) is this telling us something?

Iâm not sure what you are really asking. In my experience the term ârace conditionsâ means you are doing parallelization and you have done things in a way that is non-deterministic. Thatâs generally ok, although ârace conditionsâ can also be bugs in your code, meaning you need to have locking to protect the update of certain shared âstateâ variables that are used by all the parallel threads.

Not sure how that applies to race conditions, but the point of my previous post about the difference between -\infty and +\infty was that the direction does matter, right?

Paul, Iâve been playing around with this CUDA stuff and it gets hard once you start to implement custom streams. At least the default is âalways blockingâ, so you can rely on that. At that point it is ânot so easyâ and you can kind of have all sorts of values floating in at any time.

At first I kind of scoffed when they included an algorithm checker at the end of every routine; I get it now though, it is not that hard to have the GPU produce you resultsâŚ But they could also be horribly wrong.

I guess what I was wondering, from a âpracticalâ point of view, -Inf = Inf ? Or, noâŚ

When you implement parallel algorithms, correctness matters, as it always does. But the point is itâs a lot harder to get right. E.g. consider the classic âsingle queue, multi serverâ model for parallelizing a task. You break the work to be done up into chunks and create a queue of those chunks. Then you unleash your parallel worker threads and they each do:

Fetch next item from queue
Complete task
Record results
Repeat

There are (at least) two points in that process where all the threads are accessing a shared data structure: the work queue and the results. If you donât provide proper locking so that the updating of the queue and the results are actually single threaded, then your results are garbage.

Of course thatâs just one simple example. So maybe you can argue that it doesnât matter whether your wrong results are too negative or too positive. Wrong is wrong, right?

Iâm about to get busy with âreal lifeâ for the next 4 or 5 hours, so I wonât be able to make any further progress for a while on the original question here.

Sorry, I wonât be able to answer that question from the CUDA thread without basically taking that course myself, since you are using a bunch of technical terms that are not generic but are very specific to how their APIs work.

@TMosh Damn it. For a short time everyone benefits (but I did not include the solution you need to pass);

And I get the positive/negative part. Perhaps I should be more blunt-- Why does it slide one way or another ? I think that has to be saying âsomethingââŚ But I donât know what it isâŚ