Hi,
What do they mean by the word “Saturation” in the AlexNet Paper? It has been used few times. Does it mean unnromalized?
Hi,
What do they mean by the word “Saturation” in the AlexNet Paper? It has been used few times. Does it mean unnromalized?
The term “saturation” in that context is referring to the way the “tails” of the activation functions tanh and sigmoid flatten out and are asymptotic to horizontal lines as |z| \rightarrow \infty. Of course mathematically sigmoid(z) is never exactly equal to 0 or 1, but in floating point it can round to 0 or 1. In 64 bit floating point, it only takes z > 36 (or somewhere close to that) to “saturate” meaning to round to 1. On the negative side, you have to go quite a bit further. There are two problems with “saturation”: the simple one is that if you get \hat{y} = 1, then the loss function is invalid and gives you NaN as the output. You can fix that by checking for the saturation case and subtracting a very small \epsilon. The harder problem to fix is that the gradients are so close to zero in those cases that convergence takes forever.
There is a connection with normalization in the sense that (in general) you’re more likely to have saturation problems with non-normalized inputs (e.g. images with uint8 pixel values instead of pixel values “standardized” to range of [0,1] or [-1, 1]). Normalization can help you get away from those issues. The other important thing they discuss in the paper is to use ReLU for the hidden layer activations. Of course you’re still stuck at the output layer with softmax (which is just the higher dimensional equivalent of sigmoid), but not taking the products of lots of small gradients in the hidden layers makes the problem easier to cope with.