It’s an interesting point and a good experiment to run! We are operating in floating point here, so there are literally 2^{32} or 2^{64} different numbers we can represent between -\infty and +\infty depending on whether we use 32 bit or 64 bit floats. That’s pretty pathetic compared to the abstract beauty of \mathbb{R}. When we operate in a finite space like that, we have to deal with the issue of “numerical stability”. There can be different ways to express the same computation that are equivalent mathematically, but have different behavior w.r.t. the propagation of rounding errors when you are operating in a finite representation space like any type of floating point. The reason that the `from_logits = True`

mode is used is that it is more numerically stable. That means it gives results that are closer to the actual correct answers we would get if we could use \mathbb{R}. It’s also less code to write, so that’s the way Prof Ng will always do it when we’re using TF loss functions: the output layer will omit the activation and have the loss function compute both the activation (`sigmoid`

or `softmax`

) and the cross entropy loss as a unified computation.

BTW numerical stability may sound like a bunch of hand-waving, but it’s actually not. In the subfield of math called Numerical Analysis, there is a way to reason precisely about the error propagation properties of different computations.

They only show the expected value to 6 decimal places and your answer rounds to the same value, but notice that they use 10^{-7} as the error threshold in the test. Try it again with the `from_logits = True`

mode and it must be the case that the answer differs from the `False`

answer in the 7th decimal place. You can print your loss value with a higher resolution than the default 6 decimal places to confirm this theory:

`print("total_loss = {:0.10f}".format(total_loss))`