Softmax video 1
Softmax video 2
Softmax video 3
Softmax Lab
Over the course of these videos and labs, I learned about the softmax activation function and the SparseCategoricalCrossentropy loss function.
Unsurprisingly, given the name, these functions looked familiar to me from my physics statistical mechanics courses. I should emphasize that my field was gravitational waves and black hole binary inspirals, and to a lesser extent particle physics, exoplanets, and cosmology, but never statistical mechanics (definitely not).
The softmax function provided in lecture is well explained in the following three slides. It should be clear what formulas Iām referencing, I hope.
The statistical mechanics partition function, which represents an ensemble of microstates of a system, is defined as follows:
Z=\sum_i e^{-\beta E_i}
Letās rename some of these variables. Letās rename Z->a_{denominator} and (-\beta E_i)->z_i. We can also rename the index i to k if it helps us.
With that renaming, this sum reads,
a_{denominator} =\sum_k e^{z_k}
In other words, it is the same sum that appears in the denominator of softmax.
Letās investigate this a bit further. How does it relate to the loss function? I swear Iām going somewhere with this and have a question.
In statistical mechanics, the probability of the $i$th state \rho_i is given by
\rho_i=\frac{1}{Z}e^{-\beta E_i}
It arises from a variation of a Lagrangian, which is ultimately an extremization (min or max) of an entropy constructed from something of the form probability times log of probability.
but remember how all of this was renamed to compare to softmax, so that in the nomeclature of softmax, it reads
p_j=a_j = \frac{1}{a_{denominator}}e^{z_j}
or
p_j=a_j=\frac{e^j}{\sum_k e^{z_k}}
So, thatās the $j$th probability, p_j.
The loss function then, of state j
-ln(a_j)
is very similar to the statistical mechanics entropy of the microstates contributing to the eigenstate j,
-p_j ln(p_j)
But clearly not identical by that factor of p_j=a_j out front.
My question, then, I guess, is based on this line from the lab.
Iām not quite sure how to interpret this in light of the fact that this is similar to minimizing the entropy to find the ground state. For example, for the row
[-2.52, -1.81, 2.43, -0.8], category:2
I can see that category 2 is selected because 2.43 is the maximum value of the logit z associated with softmax (not Z from the partition function definition) and it has index 2. This is presumably because softmax is monotonically increasing so a greater logit value corresponds to a greater probability that that is the correct category.
I was wondering, then, why minimizing the partition function leads one to choose the maximum value of the logit rather than the minimum. And I think, I finally answered my own question. Itās in the definition of -\beta E_i ->z_i. Since the logit has a negative sign relative to the Energy in statistical mechanics, minimizing the cost function tends to maximize the logit, while minimizing the partition function minimizes the Energy (to its ground state) in statistical mechanics.
Thanks for your patience while I wrote this out, and since Iāve written it, I think Iāll post it.
Steven