`log_softmax` in Dense layer + `from_logits=True` in cross-entropy loss

I see NLP course uses for quite a number of times the combination of log_softmax in the Dense layer with from_logits=True in the cross-entropy loss function in order to have a stable computation of the softmax. How does it compare with using linear activation in the Dense layer with from_logits=True in the cross-entropy loss function? Isn’t there duplicate “softmax” in the first case of using log_softmax in the Dense layer since the cross-entropy loss function will perform the softmax calculation if from_logits=True?

Note that log_softmax is not the same as softmax. It is the logarithm of the softmax values. Note that the logs of numbers between 0 and 1 will all be negative, right?

Now I have not actually taken NLP C3 and I don’t remember in DLS C5 any uses of log_softmax, so I’m not sure why they use it here. But it is not the same as softmax, so you still need the softmax at the cost function level.

One other subtlety to note here is that softmax is a monotonic function. If you did take softmax of softmax outputs, the max value is still the same element, although each subsequent application of softmax reduces the scale of the difference between the max value and the rest of the values.

Here’s an example:

# Run some experiments
t_x = np.array([[9, 2, 5, 0, 0],
                [7, 5, 0, 0 ,0]])
sm_out = softmax(t_x)
print("softmax(x) = " + str(sm_out))
print(f"sum softmax = {np.sum(sm_out, axis = 1, keepdims = True)}")
sm2 = softmax(sm_out)
print(f"sm2 = {sm2}")
print(f"sum(sm2) = {np.sum(sm2, axis = 1, keepdims = True)}")
sm3 = softmax(sm2)
print(f"sm3 = {sm3}")
print(f"sum(sm3) = {np.sum(sm3, axis = 1, keepdims = True)}")
softmax(x) = [[9.80897665e-01 8.94462891e-04 1.79657674e-02 1.21052389e-04
  1.21052389e-04]
 [8.78679856e-01 1.18916387e-01 8.01252314e-04 8.01252314e-04
  8.01252314e-04]]
sum softmax = [[1.]
 [1.]]
sm2 = [[0.39886383 0.14969754 0.15227501 0.14958181 0.14958181]
 [0.36835555 0.17230828 0.15311206 0.15311206 0.15311206]]
sum(sm2) = [[1.]
 [1.]]
sm3 = [[0.24274009 0.18920385 0.18969215 0.18918195 0.18918195]
 [0.23579297 0.19381554 0.1901305  0.1901305  0.1901305 ]]
sum(sm3) = [[1.]
 [1.]]
2 Likes

Just to be 100% clear, the point of that last example is not really relevant to your question. But if you did make the mistake of including softmax in the output layer as the activation and still used from_logits = True, it would not harm your results: you would still get the same index as the predicted value. It would just be a waste of compute.

1 Like