When calculating the softmax, we use the formula
a_j = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}}
Here, the outputs ( z ) from the linear model are transformed into probabilities. It seems that instead of using the exponential base ( e ), we could use any positive number as the base, because the relative proportion contributed by each ( z_j ) would still be preserved.
So my question is:
If any positive base would preserve the proportions, why do we specifically use the natural exponential function ( e^x ) in the softmax?
Yes, itβs an interesting point that you could use any constant value as the base for the exponentiation there, but there is a very good reason why they use e: we need to do back propagation in order to learn our models and back propagation is driven by the derivatives of the activation functions, cost functions and all the other functions involved. Try taking the derivative of f(x) = 10^x and watch what happens. The whole point of using e is that the derivative of e^x is e^x.
Thatβs also why we use natural logarithms everywhere in ML instead of log_{10}. You get the same fundamental behavior in terms of the shapes of the curves, but using log_{10} just makes a mess with useless constant factors strewn everywhere when you differentiate. So keep in mind that the notation in the ML world is different than in the math world: here log always means natural log, whereas in math they use ln for natural log and log means base 10.
3 Likes
Yes, that makes perfect sense! Thank you so much for your thoughtful response, I really appreciate it.