How softmax relates to statistical mechanics

Softmax video 1
Softmax video 2
Softmax video 3
Softmax Lab

Over the course of these videos and labs, I learned about the softmax activation function and the SparseCategoricalCrossentropy loss function.

Unsurprisingly, given the name, these functions looked familiar to me from my physics statistical mechanics courses. I should emphasize that my field was gravitational waves and black hole binary inspirals, and to a lesser extent particle physics, exoplanets, and cosmology, but never statistical mechanics (definitely not).

The softmax function provided in lecture is well explained in the following three slides. It should be clear what formulas Iā€™m referencing, I hope.



The statistical mechanics partition function, which represents an ensemble of microstates of a system, is defined as follows:

Z=\sum_i e^{-\beta E_i}

Letā€™s rename some of these variables. Letā€™s rename Z->a_{denominator} and (-\beta E_i)->z_i. We can also rename the index i to k if it helps us.

With that renaming, this sum reads,

a_{denominator} =\sum_k e^{z_k}

In other words, it is the same sum that appears in the denominator of softmax.

Letā€™s investigate this a bit further. How does it relate to the loss function? I swear Iā€™m going somewhere with this and have a question.

In statistical mechanics, the probability of the $i$th state \rho_i is given by

\rho_i=\frac{1}{Z}e^{-\beta E_i}

It arises from a variation of a Lagrangian, which is ultimately an extremization (min or max) of an entropy constructed from something of the form probability times log of probability.

but remember how all of this was renamed to compare to softmax, so that in the nomeclature of softmax, it reads

p_j=a_j = \frac{1}{a_{denominator}}e^{z_j}

or

p_j=a_j=\frac{e^j}{\sum_k e^{z_k}}

So, thatā€™s the $j$th probability, p_j.

The loss function then, of state j

-ln(a_j)

is very similar to the statistical mechanics entropy of the microstates contributing to the eigenstate j,

-p_j ln(p_j)

But clearly not identical by that factor of p_j=a_j out front.

My question, then, I guess, is based on this line from the lab.

Iā€™m not quite sure how to interpret this in light of the fact that this is similar to minimizing the entropy to find the ground state. For example, for the row

[-2.52, -1.81, 2.43, -0.8], category:2

I can see that category 2 is selected because 2.43 is the maximum value of the logit z associated with softmax (not Z from the partition function definition) and it has index 2. This is presumably because softmax is monotonically increasing so a greater logit value corresponds to a greater probability that that is the correct category.

I was wondering, then, why minimizing the partition function leads one to choose the maximum value of the logit rather than the minimum. And I think, I finally answered my own question. Itā€™s in the definition of -\beta E_i ->z_i. Since the logit has a negative sign relative to the Energy in statistical mechanics, minimizing the cost function tends to maximize the logit, while minimizing the partition function minimizes the Energy (to its ground state) in statistical mechanics.

Thanks for your patience while I wrote this out, and since Iā€™ve written it, I think Iā€™ll post it.

Steven

As long as Iā€™m asking, does anyone know if neural networks are capable of phase transformations??? (!!!)

I donā€™t know anything about statistical mechanics, so I have no idea about any of that and didnā€™t really read everything you typed.

The point of the statement youā€™re asking about is that softmax is strictly monotonic increasing, as you mentioned. So if all you care about is which input class gives the maximum output probability value, you can just look at which input is the largest. Not in terms of absolute value, but in terms of actual position on the number line. The greatest input will produce the greatest output.

Iā€™m not sure that was the point, at all, but I do appreciate you reiterating the thing that resolved my (edit: initial question) question. The point was that the partition function and entropy in statistical mechanics are mathematically identical or similar to softmax and I wanted to know why statistical mechanics minimizes the energy eigenstate while softmax maximizes the logit. Thatā€™s not a coincidence, and the reason involves the similarity of the softmax activation function to the statistical mechanics partition function and the process of minimization from calculus.

My follow up question was given the similarity, and the fact that statistical mechanics is the underlying principle behind thermodynamics, does that imply that a neural network that uses softmax can undergo phase transitions like a solid/fluid/gas system (because they are mathematically the same)? I donā€™t mean that a computer literally becomes liquid, I mean that there are discontinuities in the behavior of the calculation that have sudden changes in the properties of how it behaves, depending on the range of the parameters. Or at least Iā€™m wondering if thatā€™s true in practice, since it seems like the math implies it should be true (not speculatively, this is very solid mathematical results going back to the 1800s). (edit: and extraordinarily well tested by physics in the lab, to the extent that the physical theory is relevant to the mathematical result that is exactly or nearly exactly applicable here).

As mentioned above, I didnā€™t get beyond freshman physics, so am not qualified to comment on the statistical mechanics questions. I was not even aware that there was a statistical mechanics explanation for solid/fluid/gas phase transitions. In any case, I donā€™t think any of the statistical mechanics applies to neural networks: the question is what the inputs are. In the NN case, they are not the statistical properties of actual particles subject to the laws of physics, they are simply the outputs of an artificial neural network derived from whatever the inputs are (e.g. pixel values of some image), based on the learned weights. As I commented on one of your earlier threads, there is no ā€œtheoryā€ for what those values are. If youā€™re lucky, your cost function and back propagation can create learned weights for the various layers of your neural network such that the softmax results are actually a good approximation of the labels on your training data. Thatā€™s the best you can hope for. And of course it depends on having made good choices for all the various hyperparameters you need to select (architecture of your network and so forth).

Hi @s-dorsher

One needs to understand statistical mechanics of phase transition where you mention a variation transition from solid==>Liquid==>Gas==>Solid is dependent on variable temperature and also environment. So if you choose environment as neural network and temperature as a feature, the softmax activation here will classify the object/particle based on the conditions the neural network finds in that current neuron.

Say your last neuron has the activation of softmax and the object or particle is passing through its last dense layer, the softmax activation is allowing to detect the object in the specified category based on the feature temperature or environment (unknown feature) for which one must have built the neural network model to identify what other features other than temperature is responsible for this transition between itā€™s state from Solid==>Liquid==>Gas==>Solid.

I have tried to explain to you with the example you are having doubt related to, i.e. if neural network has the ability to statistical mechanics. So the answer is yes but it is dependent on the person who creates a model (the touch factor :slightly_smiling_face:).

Say for the same model, for person who didnā€™t know about statistics and used instead a sigmoid activation. So the model will still work and give a loss and accuracy but based on a binary classification where it will again divide the phases into two classes giving a not so good model in terms of feature detection when the same model would be used on non tested data as the object or particle was modelled into two class features instead of its multifeatures specificity.

So yes neural network does mimic statistical mechanics.

I donā€™t think itā€™s necessary for temperature to be a feature.

The temperature is defined to be the derivative of the entropy with respect to the internal energy

T = \frac{\partial S}{\partial U}

The internal energy is
U = <E> = \frac{1}{Z}\frac{\partial Z}{\partial\beta}

Translated to the variables used in machine learning, if loss = L, then

T = \frac{\partial L}{\partial U}

Where U is

U = \frac{1}{a_{denominator}}\frac{\partial a_{denominator}}{\partial w}

Since \beta is the coefficient on E_i in thermodynamics and w is the coefficient on z^{(i)}_j in machine learning.

The distinction is that z depends on both neurons and features, but here the neuron sum plays the analogous role to the sum over energy states in a thermodynamic ensemble (microcanonical ensemble?).

So I think a mathematical temperature emerges from the system describing the occupancy of the possible states of the system, whether or not a physical temperature (in the sense of something you touch) is actually a feature.

remember when I mentioned temperature, I also mentioned the neural networkā€™s ability to correlate other features between the relation between temperature and phase transition and not independently.

Anyways my idea of explaining that transition in relation to temperature was more in terms of statistical mechanics and as well as we all know the transition between the phases is more of relation between temperature and pressure.

You answered your own query here.

I donā€™t think that I did. But I donā€™t think you addressed it either.

:grimacing:

Iā€™m not asking about a temperature feature. Youā€™re missing my point altogether. Maybe this requires too much knowledge of graduate level physics.

My question, I guess, because of the physics, is this:

When parameters are trained in a neural network (and it doesnā€™t matter what the features are for this question), are there regions of parameter space that exhibit different characteristics of behavior, or where a relevant derivative undergoes a discontinuity in moving between them?

First and second order phase transitions, for reference

I know you arenā€™t asking about temperature feature. the whole explanation was elaborating how deep neural network understands the statistical mechanics that is fed into them is to understand the complexities in any object, space, data, findings or phase mechanics.

if your question is does these parameters exhibit statistical mechanics or different characterstics of behaviour, then again the answer is yes. Remember the whole idea of neural network is its ability to understand itā€™s relativity of parameter/parameters to the outcome we are getting.

I am not graduate in physics :smirk: but have studied thermodynamics and statistics.

With that being said donā€™t treat a parameter universal, it surely dependent on what other co-factors work along in a deep neural network for a parameter to act on softmax activation or sigmoid which may or may not include data points, data spread and correlative conditions to the parameters as well as parameter on its own.

They are related. Youā€™ll find the term ā€œpartition functionā€ sprinkled across mathematical statistics books, especially in Bayesian statistics. This Stack Exchange answer provides some context: statistical mechanics - Softmax Function - Relation to Stat Mech? - Physics Stack Exchange

1 Like

Yes I was asking about what happens because of this

To elaborate Iā€™m familiar with and have professionally worked with covariances in statistics though I realize I have not yet seen how they mathematically manifest in machine learning

Though, actually, stating that explicitly is a really good point. And the longer o think about it the more interesting it gets in this context

That goes beyond my statmech knowledge but I guess thatā€™s similar to things like statmech with nearest neighbor interactions (or more) :exploding_head:

1 Like

This could also be considered a solution. Very very very helpful! Thank you!

Is it too big a leap to say, no wonder this is so great for protein folding?

are you now asking about k-nearest neighbour ??

I think overall understanding to have is statistical analysis is part of programming be it, SAS, R Or python

Be it use mathematical equations or scientific findings, ai/ml tries to incorporate all the mechanics be it statistics, computation and permutations.

Inspired by the math some of us tried to build a (deep factorization machine) model for protein folding. The model performed well when trained on multiple sequence alignments. During diagnostics we noticed it failed to capture any structure information - just goes to say that loss functions tailored to predict binary outcomes (if a protein will fold ā€˜in the wildā€™) does not automatically guarantee good features (~ secondary or tertiary structure) engineered by the model. But this should not discourage you from trying. There may be more principled ML models that better represent the underlying phenomenon of protein folding.

Note: from my understanding Alpha Fold uses a multi stage approach to mitigate the problem I described.

1 Like

The nearest neighbor interaction, thermodynamically, is the Ising model. That might be similar to k-nearest neighbor, but I donā€™t think so. Iā€™m not familiar enough with how to derive knn from a cost function or Hamiltonian yet though.

Ising Model-- wikipedia

The Hamiltonian is this equation
H(\sigma)=-\sum_{ij}J_{ij}\sigma_i\sigma_j-\mu\sum_jh_j\sigma_j

\sigma's are spins, \mu is the magnetic moment, h is the orientation and magnitude of the magnetic field. J characterizes the strength of the interaction between adjacent spins.

Iā€™m not sure how similar this is to knn. I really just donā€™t know yet. Iā€™m sorry.

The reason I was thinking of this is that if there are correlations between coefficients in a neural network described by softmax, maybe it would not have a simple Hamiltonian but rather would have one with interactions between ā€œparticlesā€ or in this case, I think that refers to correlations between coefficients of features within a neuron if the interaction does not cause an ā€œenergy level transitionā€ and between coefficients of features between neurons if it does.

Hereā€™s a more complicated model, if relevant, that addresses phase transitions, though itā€™s beyond what Iā€™m familiar with and also behind a paywall to me. It looks relevant to me, but idk. I saw a talk on something like this once by someone who worked at LSU, but I couldnā€™t begin to tell you what it was or who. Sorry!

Phase transitions and thermodynamic properties of antiferromagnetic Ising model with next-nearest-neighbor interactions on the KagomƩ lattice

Relevance to biology maybe (this is a bit beyond me):
Nucleic acid thermodynamics-- wikipedia