Deep Learning specialization course softmax implementation has `z - np.max(z)`. Why?

I am in the RNN character generation assignment and I just found out that in utils.py used in Deep Learning specialization, softmax implementation subtracts the input z from the max value in the vector before taking the exponential value of it in the numerator. Why is that so? How does it differ from the simple / “original” implementation without this “normalization”? Is this normalization the recommended way of implementing softmax?

Yes, that is a common way to implement softmax. The short answer for why is to make the computation more reliable and efficient when you are executing it in the finite representation space of floating point as opposed to the abstract beauty of \mathbb{R}.

You can easily prove that softmax(x) = softmax(x - a) for any constant a. (I leave that as an exercise for the reader.) The reason they subtract max(x) is that it prevents floating point overflow. Computing softmax obviously involves evaluating e^x for various values of x. Even in 64 bit floating point, it doesn’t take a very large value of x for e^x to overflow and generate Inf and cause the overall computation to return NaN. Subtracting max(x) gives values that are \leq 0, which fixes that problem. The values of e^x are tractable for x \leq 0.

1 Like

Just out of curiosity, I went back and reran the experiments and the numbers are actually bigger than I remembered:

z = np.array([-0.5, 0, 0.5, 1, 10, 38, 50, 75, 90]).astype(np.float32)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")
z = np.array([-0.5, 0, 0.5, 1, 10, 38, 90, 120, 150, 500, 750]).astype(np.float64)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")

z = [-0.5  0.   0.5  1.  10.  38.  50.  75.  90. ]
np.exp(z) = [6.0653067e-01 1.0000000e+00 1.6487212e+00 2.7182820e+00 2.2026467e+04
 3.1855931e+16 5.1847055e+21 3.7332418e+32           inf]
z = [-5.0e-01  0.0e+00  5.0e-01  1.0e+00  1.0e+01  3.8e+01  9.0e+01  1.2e+02
  1.5e+02  5.0e+02  7.5e+02]
np.exp(z) = [6.06530660e-001 1.00000000e+000 1.64872127e+000 2.71828183e+000
 2.20264658e+004 3.18559318e+016 1.22040329e+039 1.30418088e+052
 1.39370958e+065 1.40359222e+217             inf]

As we see, in 32 bit floating point, you get Inf somewhere between 75 and 90.

For 64 bit, you have to go quite a bit further.

What I was remembering is that the cross entropy loss function goes off the rails at z > 36 even in 64 bit FP.

# probe the limits of saturation
v = np.array([35.0, 36.0, 37.0, 38.0]).astype('float64')
print(f"sigmoid(v) = {sigmoid(v)}")
print(f"np.log(1 - sigmoid(v)) = {np.log(1 - sigmoid(v))}")
v = np.array([15.0, 16.0, 17.0, 18.0]).astype('float32')
print(f"sigmoid(v) = {sigmoid(v)}")
print(f"np.log(1 - sigmoid(v)) = {np.log(1 - sigmoid(v))}")
sigmoid(v) = [1. 1. 1. 1.]
np.log(1 - sigmoid(v)) = [-34.9450411  -36.04365339         -inf         -inf]
sigmoid(v) = [0.99999964 0.9999999  1.         1.        ]
np.log(1 - sigmoid(v)) = [-14.843773 -15.942385       -inf       -inf]
1 Like

There is a misunderstanding here. I was referring to e_x = np.exp(x - np.max(x)). Not softmax(x - max(x))

I create the following test snippet:

def SoftmaxTests():
    v = numpy.array([35.0, 36.0, 37.0, 38.0]).astype('float64')
    print(f"softmax(v): {softmax(v)}")
    print(f"softmax1(v): {softmax1(v)}")
    #assert (softmax(v) == softmax1(v)).all() Assertion error

    print(f"softmax(v): {softmax(v)}")
    print(f"softmax(v - max(v)): {softmax(v - numpy.max(v))}")
    #assert (softmax(v) == softmax(v - numpy.max(v))).all() Assertion error

    print(f"softmax1(v): {softmax1(v)}")
    print(f"softmax1(v - max(v)): {softmax1(v - numpy.max(v))}")
    assert (softmax1(v) == softmax1(v - numpy.max(v))).all()

softmax: without - max(z) normalization
softmax1: with - max(z) normalization

The console output shows that the result values are the “same”:

softmax(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax1(v): [0.0320586  0.08714432 0.23688282 0.64391426]

softmax(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax(v - max(v)): [0.0320586  0.08714432 0.23688282 0.64391426]

softmax1(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax1(v - max(v)): [0.0320586  0.08714432 0.23688282 0.64391426]

Notice that all assertions which involve softmax(v) result in Assertion error although the console shows the values do match element-wise. My guess is that assert compares to the full range of the float64 which is not shown in the console but the mismatch happens somewhere down in the digits due to overflow.

In contrast, the last assertion never fails which could be evident that there is no overflow.

However, what catch my attention and worth investigating here is that I put a try-catch block to catch any OverflowError but it never throws that error at all and I don’t understand why:

    try:
        ez = numpy.exp(z)
        return ez / numpy.sum(ez)
    except OverflowError as e:
        print(f"Overflow! {e}")
    except Exception as e:
        print(e)

Try that test with float32 instead of float64. If you followed the tests that I showed, it requires x = 750 in order to get overflow in float64, which is what you’re using. So it’s not surprising that your test does not get overflow.

Also note that in the results I showed above, in the float32 case it takes x = 90 to get overflow and your values max out at 38.

Here is the code we are discussing from utils.py:

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

That is mathematically equivalent to softmax(x - np.max(x)), if we started with the plain vanilla version of softmax:

def softmax(x):
    e_x = np.exp(x)
    return e_x / e_x.sum(axis=0)

That was my point.

Have you taken the trouble to prove the theorem that I stated in my original response? It’s just a simple application of the laws of exponents:

e^{x - a} = e^x * e^{-a} = \displaystyle \frac {e^x}{e^a}

Now carry that through the full formula for softmax and you’ll see that you get a factor of \displaystyle \frac {1}{e^a} in both numerator and denominator, which cancel out.

1 Like

Following up on your experiment, here’s my version:

z = np.array([-0.5, 0, 0.5, 1, 10, 38, 50, 75, 90]).astype(np.float32)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")

try:
    ez = np.exp(z)
    smax = ez / np.sum(ez)
    print(f"smax {smax}")
except OverflowError as e:
    print(f"Overflow! {e}")
except Exception as e:
    print(f"Exception {e}")

And here is what that generates:

z = [-0.5  0.   0.5  1.  10.  38.  50.  75.  90. ]
np.exp(z) = [6.0653067e-01 1.0000000e+00 1.6487212e+00 2.7182820e+00 2.2026467e+04
 3.1855931e+16 5.1847055e+21 3.7332418e+32           inf]
smax [ 0.  0.  0.  0.  0.  0.  0.  0. nan]

So what we conclude is that floating point producing Inf or NaN does not generate a python exception and that x = 90 is enough to get +\infty from np.exp in float32 mode, but x = 38 is not.

Let Z = x - max(x)
e^(x - max(x)) = e^x * e^-max(x) = e^x  / e^max(x)
g(e^(x - max(x))) = (e^x  / e^max(x)) / sum(e^x  / e^max(x)) = (e^x  / e^max(x)) / ((1  / e^max(x)) * sum(e^x)) = ((e^x  / e^max(x)) / sum(e^x)) * e^max(x) = e^x / sum(e^x)
So, softmax(x) = softmax(1 - whatever)

Is that a Python bug? When is OverflowError thrown?

You can read up on OverFlowError on the python documentation site. It sounds like it applies only to native python arithmetic. Numpy is a separate package and has its own rules: it obeys the IEEE 754 standard which specifies when Inf and NaN are generated. Exceptions are not thrown when that happens. You can check your results with numpy functions like isinf and isnan.

3 Likes