`log_softmax` in Dense layer + `from_logits=True` in cross-entropy loss

khteh · October 16, 2025, 11:41am

I see NLP course uses for quite a number of times the combination of log_softmax in the Dense layer with from_logits=True in the cross-entropy loss function in order to have a stable computation of the softmax. How does it compare with using linear activation in the Dense layer with from_logits=True in the cross-entropy loss function? Isn’t there duplicate “softmax” in the first case of using log_softmax in the Dense layer since the cross-entropy loss function will perform the softmax calculation if from_logits=True?

paulinpaloalto · October 16, 2025, 3:59pm

Note that log_softmax is not the same as softmax. It is the logarithm of the softmax values. Note that the logs of numbers between 0 and 1 will all be negative, right?

Now I have not actually taken NLP C3 and I don’t remember in DLS C5 any uses of log_softmax, so I’m not sure why they use it here. But it is not the same as softmax, so you still need the softmax at the cost function level.

One other subtlety to note here is that softmax is a monotonic function. If you did take softmax of softmax outputs, the max value is still the same element, although each subsequent application of softmax reduces the scale of the difference between the max value and the rest of the values.

Here’s an example:

# Run some experiments
t_x = np.array([[9, 2, 5, 0, 0],
                [7, 5, 0, 0 ,0]])
sm_out = softmax(t_x)
print("softmax(x) = " + str(sm_out))
print(f"sum softmax = {np.sum(sm_out, axis = 1, keepdims = True)}")
sm2 = softmax(sm_out)
print(f"sm2 = {sm2}")
print(f"sum(sm2) = {np.sum(sm2, axis = 1, keepdims = True)}")
sm3 = softmax(sm2)
print(f"sm3 = {sm3}")
print(f"sum(sm3) = {np.sum(sm3, axis = 1, keepdims = True)}")
softmax(x) = [[9.80897665e-01 8.94462891e-04 1.79657674e-02 1.21052389e-04
  1.21052389e-04]
 [8.78679856e-01 1.18916387e-01 8.01252314e-04 8.01252314e-04
  8.01252314e-04]]
sum softmax = [[1.]
 [1.]]
sm2 = [[0.39886383 0.14969754 0.15227501 0.14958181 0.14958181]
 [0.36835555 0.17230828 0.15311206 0.15311206 0.15311206]]
sum(sm2) = [[1.]
 [1.]]
sm3 = [[0.24274009 0.18920385 0.18969215 0.18918195 0.18918195]
 [0.23579297 0.19381554 0.1901305  0.1901305  0.1901305 ]]
sum(sm3) = [[1.]
 [1.]]

paulinpaloalto · October 16, 2025, 4:05pm

Just to be 100% clear, the point of that last example is not really relevant to your question. But if you did make the mistake of including softmax in the output layer as the activation and still used from_logits = True, it would not harm your results: you would still get the same index as the predicted value. It would just be a waste of compute.

Topic		Replies	Views
Numerical correct implementation of softmax Advanced Learning Algorithms week-module-2	6	626	December 24, 2022
C2_W2_SoftMax lab Advanced Learning Algorithms week-module-2	5	252	March 20, 2024
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-module-2	1	74	June 25, 2024
Activation function of SoftMax after optimization Advanced Learning Algorithms week-module-2	5	417	July 23, 2023
Week 2 - Improved implementation with SoftMax Advanced Learning Algorithms week-module-2	10	738	December 1, 2023

`log_softmax` in Dense layer + `from_logits=True` in cross-entropy loss

Related topics