Deep Learning specialization course softmax implementation has `z - np.max(z)`. Why?

khteh · September 10, 2025, 3:48am

I am in the RNN character generation assignment and I just found out that in utils.py used in Deep Learning specialization, softmax implementation subtracts the input z from the max value in the vector before taking the exponential value of it in the numerator. Why is that so? How does it differ from the simple / “original” implementation without this “normalization”? Is this normalization the recommended way of implementing softmax?

paulinpaloalto · September 10, 2025, 2:46pm

Yes, that is a common way to implement softmax. The short answer for why is to make the computation more reliable and efficient when you are executing it in the finite representation space of floating point as opposed to the abstract beauty of \mathbb{R}.

You can easily prove that softmax(x) = softmax(x - a) for any constant a. (I leave that as an exercise for the reader.) The reason they subtract max(x) is that it prevents floating point overflow. Computing softmax obviously involves evaluating e^x for various values of x. Even in 64 bit floating point, it doesn’t take a very large value of x for e^x to overflow and generate Inf and cause the overall computation to return NaN. Subtracting max(x) gives values that are \leq 0, which fixes that problem. The values of e^x are tractable for x \leq 0.

paulinpaloalto · September 10, 2025, 7:15pm

Just out of curiosity, I went back and reran the experiments and the numbers are actually bigger than I remembered:

z = np.array([-0.5, 0, 0.5, 1, 10, 38, 50, 75, 90]).astype(np.float32)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")
z = np.array([-0.5, 0, 0.5, 1, 10, 38, 90, 120, 150, 500, 750]).astype(np.float64)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")

z = [-0.5  0.   0.5  1.  10.  38.  50.  75.  90. ]
np.exp(z) = [6.0653067e-01 1.0000000e+00 1.6487212e+00 2.7182820e+00 2.2026467e+04
 3.1855931e+16 5.1847055e+21 3.7332418e+32           inf]
z = [-5.0e-01  0.0e+00  5.0e-01  1.0e+00  1.0e+01  3.8e+01  9.0e+01  1.2e+02
  1.5e+02  5.0e+02  7.5e+02]
np.exp(z) = [6.06530660e-001 1.00000000e+000 1.64872127e+000 2.71828183e+000
 2.20264658e+004 3.18559318e+016 1.22040329e+039 1.30418088e+052
 1.39370958e+065 1.40359222e+217             inf]

As we see, in 32 bit floating point, you get Inf somewhere between 75 and 90.

For 64 bit, you have to go quite a bit further.

What I was remembering is that the cross entropy loss function goes off the rails at z > 36 even in 64 bit FP.

# probe the limits of saturation
v = np.array([35.0, 36.0, 37.0, 38.0]).astype('float64')
print(f"sigmoid(v) = {sigmoid(v)}")
print(f"np.log(1 - sigmoid(v)) = {np.log(1 - sigmoid(v))}")
v = np.array([15.0, 16.0, 17.0, 18.0]).astype('float32')
print(f"sigmoid(v) = {sigmoid(v)}")
print(f"np.log(1 - sigmoid(v)) = {np.log(1 - sigmoid(v))}")

sigmoid(v) = [1. 1. 1. 1.]
np.log(1 - sigmoid(v)) = [-34.9450411  -36.04365339         -inf         -inf]
sigmoid(v) = [0.99999964 0.9999999  1.         1.        ]
np.log(1 - sigmoid(v)) = [-14.843773 -15.942385       -inf       -inf]

khteh · September 11, 2025, 7:27am

There is a misunderstanding here. I was referring to e_x = np.exp(x - np.max(x)). Not softmax(x - max(x))

khteh · September 11, 2025, 7:55am

I create the following test snippet:

def SoftmaxTests():
    v = numpy.array([35.0, 36.0, 37.0, 38.0]).astype('float64')
    print(f"softmax(v): {softmax(v)}")
    print(f"softmax1(v): {softmax1(v)}")
    #assert (softmax(v) == softmax1(v)).all() Assertion error

    print(f"softmax(v): {softmax(v)}")
    print(f"softmax(v - max(v)): {softmax(v - numpy.max(v))}")
    #assert (softmax(v) == softmax(v - numpy.max(v))).all() Assertion error

    print(f"softmax1(v): {softmax1(v)}")
    print(f"softmax1(v - max(v)): {softmax1(v - numpy.max(v))}")
    assert (softmax1(v) == softmax1(v - numpy.max(v))).all()

softmax: without - max(z) normalization
softmax1: with - max(z) normalization

The console output shows that the result values are the “same”:

softmax(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax1(v): [0.0320586  0.08714432 0.23688282 0.64391426]

softmax(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax(v - max(v)): [0.0320586  0.08714432 0.23688282 0.64391426]

softmax1(v): [0.0320586  0.08714432 0.23688282 0.64391426]
softmax1(v - max(v)): [0.0320586  0.08714432 0.23688282 0.64391426]

Notice that all assertions which involve softmax(v) result in Assertion error although the console shows the values do match element-wise. My guess is that assert compares to the full range of the float64 which is not shown in the console but the mismatch happens somewhere down in the digits due to overflow.

In contrast, the last assertion never fails which could be evident that there is no overflow.

However, what catch my attention and worth investigating here is that I put a try-catch block to catch any OverflowError but it never throws that error at all and I don’t understand why:

    try:
        ez = numpy.exp(z)
        return ez / numpy.sum(ez)
    except OverflowError as e:
        print(f"Overflow! {e}")
    except Exception as e:
        print(e)

paulinpaloalto · September 11, 2025, 2:15pm

Try that test with float32 instead of float64. If you followed the tests that I showed, it requires x = 750 in order to get overflow in float64, which is what you’re using. So it’s not surprising that your test does not get overflow.

Also note that in the results I showed above, in the float32 case it takes x = 90 to get overflow and your values max out at 38.

paulinpaloalto · September 11, 2025, 2:19pm

Here is the code we are discussing from utils.py:

def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

That is mathematically equivalent to softmax(x - np.max(x)), if we started with the plain vanilla version of softmax:

def softmax(x):
    e_x = np.exp(x)
    return e_x / e_x.sum(axis=0)

That was my point.

Have you taken the trouble to prove the theorem that I stated in my original response? It’s just a simple application of the laws of exponents:

e^{x - a} = e^x * e^{-a} = \displaystyle \frac {e^x}{e^a}

Now carry that through the full formula for softmax and you’ll see that you get a factor of \displaystyle \frac {1}{e^a} in both numerator and denominator, which cancel out.

paulinpaloalto · September 11, 2025, 5:39pm

Following up on your experiment, here’s my version:

z = np.array([-0.5, 0, 0.5, 1, 10, 38, 50, 75, 90]).astype(np.float32)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")

try:
    ez = np.exp(z)
    smax = ez / np.sum(ez)
    print(f"smax {smax}")
except OverflowError as e:
    print(f"Overflow! {e}")
except Exception as e:
    print(f"Exception {e}")

And here is what that generates:

z = [-0.5  0.   0.5  1.  10.  38.  50.  75.  90. ]
np.exp(z) = [6.0653067e-01 1.0000000e+00 1.6487212e+00 2.7182820e+00 2.2026467e+04
 3.1855931e+16 5.1847055e+21 3.7332418e+32           inf]
smax [ 0.  0.  0.  0.  0.  0.  0.  0. nan]

So what we conclude is that floating point producing Inf or NaN does not generate a python exception and that x = 90 is enough to get +\infty from np.exp in float32 mode, but x = 38 is not.

khteh · September 12, 2025, 2:26am

Let Z = x - max(x)
e^(x - max(x)) = e^x * e^-max(x) = e^x  / e^max(x)
g(e^(x - max(x))) = (e^x  / e^max(x)) / sum(e^x  / e^max(x)) = (e^x  / e^max(x)) / ((1  / e^max(x)) * sum(e^x)) = ((e^x  / e^max(x)) / sum(e^x)) * e^max(x) = e^x / sum(e^x)
So, softmax(x) = softmax(1 - whatever)

Is that a Python bug? When is OverflowError thrown?

paulinpaloalto · September 12, 2025, 3:28am

You can read up on OverFlowError on the python documentation site. It sounds like it applies only to native python arithmetic. Numpy is a separate package and has its own rules: it obeys the IEEE 754 standard which specifies when Inf and NaN are generated. Exceptions are not thrown when that happens. You can check your results with numpy functions like isinf and isnan.

Topic		Replies	Views
Week 1 Assignment1 Softmax function Sequence Models week-module-1 , coursera-platform	2	46	November 28, 2024
Purpose of using numerically accurate implementation of softmax Advanced Learning Algorithms week-module-2	5	434	March 4, 2024
C2_W4_Assignment UNQ_C2 GRADED FUNCTION: softmax NLP with Probabilistic Models week-module-4	4	33	August 17, 2024
W2_A1_Softmax_test(softmax) Neural Networks and Deep Learning coursera-platform	2	460	June 23, 2023
W2_A1_Optional exercise_AssertionError: Not all tests were passed for softmax Neural Networks and Deep Learning coursera-platform	3	562	January 9, 2023

Deep Learning specialization course softmax implementation has `z - np.max(z)`. Why?

Related topics