Size of the expected output does not make sense

I think we must just be missing your point here. If you have a real number that you are trying to approximate, there are two ways your approximation can be wrong: it can be too low or too high, right? And if you are getting real divergence instead of convergence (meaning that your learning rate is way too high or, as in Chay’s case here, the algorithm is just broken), it can go in either direction, right?

He was adding the gradients instead of subtracting them, so things were going in the opposite direction that they should have to make the answer better. But the new more wrong answer can be more of an underestimate or more of an overestimate.

@paulinpaloalto I get that part, but whether we head North or South must be signalling different things, like ‘gradient explosion’ vs ‘vanishing gradients’ no ? I mean going to infinity is an extreme, extreme case. But the direction in signalling is apparent.

The other thing to remember here is that the real cause of the NaN when dealing with cross entropy loss is floating point “saturation”. If you look at the cost function, there are two terms: the y = 1 term and the y = 0 term. In the first you have log(\hat{y}) and in the second you have log(1 - \hat{y}), so if you ever hit a case in which \hat{y} = 1 or \hat{y} = 0 exactly, then you end up with log(0) which is -Inf. And as I demonstrated earlier, if you get unlucky and that is multiplied by 0 in the form of either y or (1 - y), then you end up with NaN. That must be what is happening in Chay’s case, which I hope to demonstrate once I find the time.

Well, I think you mean ‘overflow’, but we can talk about this another time, and we are ‘talking shop’ about what really is probably ‘not the student’s problem’.

In any case – respect

For anyone that doesn’t know me here-- You are lucky to have a bunch of very intelligent volunteer mentors, but when I am unsure about something… Man, like a shark, I bite.

This, however, is not representative of any of my colleagues :grin:

Overflow means the number you need to represent is too large to be represented in the type of floating point you are using. In binary64 that would be 1.8 * 10^{308} or thereabouts. I’m describing a different phenomenon: the accuracy you can represent in floating point is too small to represent the difference you need to represent. The value of \hat{y} is the output of sigmoid, so it can never be exactly 0 or 1 mathematically, right? But in floating point it can “saturate” and round to exactly 0 or 1. Or maybe you could call that a form of “underflow”, but “saturation” is the term I’ve seen used.

I cannot think how else you get to NaN, less it is overflow, or you are doing something ‘non-computational’. Perhaps you can educate me.

(or perhaps as you suggest, ‘underflow’… But this would be a really strange thing, exclusive to Neural Nets-- Or I have never had to deal with that problem previously, but now you have me considering…)

I demonstrated the phenomenon earlier in this thread. If you try to compute 0 * Inf, that gives you NaN. 42 * Inf is Inf, but 0 * Inf just doesn’t make any sense. It is “Not a Number”.

Please go back and actually study those examples I gave. I demonstrated a number of things about how Inf and NaN behave in IEEE 754 floating point.

This really isn’t worth spending this much energy on. If you get either Inf or NaN it means something is wrong and you need to figure out what it was that went wrong. There are cases in which you know it is a risk because of “saturation” and you can defend yourself against it. As an example, here’s a thread which talks about that.

It is not Paul;.because I meant ‘no offense’, to you or anyone.

As I am ever so slightly unsure this morning, I just wished to clarify I am lucky to have gotten to know Paul a bit-- So, sometimes I joust/tease him a little bit.

Otherwise, it is only ‘intellectually’ I am a shark, or I take the bait.

Sorry if metaphors are ‘verboten’.