How softmax relates to statistical mechanics

s-dorsher · December 30, 2024, 2:48pm

s-dorsher:

The Hamiltonian is this equation
H(\sigma)=-\sum_{ij}J_{ij}\sigma_i\sigma_j-\mu\sum_jh_j\sigma_jH(σ)=−∑ijJijσiσj−μ∑jhjσjH(\sigma)=-\sum_{ij}J_{ij}\sigma_i\sigma_j-\mu\sum_jh_j\sigma_j

\sigmaσ\sigma’s are spins, \muμ\mu is the magnetic moment, hhh is the orientation and magnitude of the magnetic field. JJJ characterizes the strength of the interaction between adjacent spins.

I’m not sure how similar this is to knn. I really just don’t know yet. I’m sorry.

The reason I was thinking of this is that if there are correlations between coefficients in a neural network described by softmax, maybe it would not have a simple Hamiltonian but rather would have one with interactions between “particles” or in this case, I think that refers to correlations between coefficients of features within a neuron if the interaction does not cause an “energy level transition” and between coefficients of features between neurons if it does.

Elaborating on this a bit:

in the partition function

Z = \sum_i e^{-\beta E_i}

The E_i are the eigenvalues of the Hamiltonian. So if there’s a correlation between states i and j then it is possible to have a transition between them, and the eigenvalues will not be diagonal. So there may be additional terms in the exponent, I believe. If the correlation between i and j merely modifies the energy level E_i\rightarrow E_i^\prime and likewise for j then the the coefficients w_i would be correlated but there would be no correlation between neurons. In contrast, if it caused a transition, so that there’s “entanglement” between E_i and E_j, then the w's between neurons would be correlated as well.

s-dorsher · December 30, 2024, 2:49pm

Can you please explain this? I do want to learn. I don’t know this yet.

Not in enough detail, anyway, only vaguely

Deepti_Prasad · December 30, 2024, 3:22pm

by this statement are you assuming neurons=parameter/parameters, than the assumption is incorrect.

neural network comes in layer where each layer might have number of neurons or a single neuron based on the data we are working.

So somehow, what I can see you are trying to look for the thermodynamics into neural network which would not completely just, as neural network doesn’t just incorporates statistical mechanics but also a mix of mathematical computational and statistical mechanics.

Deep neural network is a very vast topic, where one can get gross into the subject we are more close or we are unknown too. Neural network doesn’t mimic end-to-end statistics or thermodynamics but incorporates features relative to statistics or mathematical computation for the neural units to understand the parameter/features fed into them, so they can try to understand the hidden ability of mechanics behind the outcome.

To your other questions, I will respond after going through the links and try my best to respond.

Regards
DP

s-dorsher · December 30, 2024, 3:51pm

No… the w and b vectors for each neuron that are vectors over each feature are the parameters

Deepti_Prasad · December 30, 2024, 4:05pm

Each input neuron is associated with a weight, which represents it’s significance of the connection between the input neuron and the output neuron where as bias is added to the input layer to provide with additional flexibility in modeling complex patterns in the input data.

So here there is no statistics or thermodynamics.

Weights indicate how much each input affects a neuron and how much it can contribute to a prediction.

Neuron are the check point for the weight to move forward to find significance in the input data to predictive outcome. During this propagation of weight from input neuron to output neuron, weight pass through layers where usually statistical mechanics is used.

as mentioned earlier it is not just complete statistics, but it is combination of computation of gradient descent, computation cost with the use of statistical mechanics in deep neural networks.

s-dorsher · December 30, 2024, 4:05pm

I realize w and b can also be treated as matrices for soft max but in this particular analysis when discussing energy levels it’s helpful to think of it this way

Writing

w^{(i)}_j

For the i th neuron and the j th feature, then an interaction that causes correlations can create correlations within

w^{(i)}_j and w^{(i)}_k

Or within

w^{(i)}_j and w^{(n)}_j

Or both

The first case corresponds to

E_j \rightarrow E_j^\prime and likewise for k within neuron i

The second case creates an energy transition, by analogy, between neuron i and neuron n

Deepti_Prasad · December 30, 2024, 4:48pm

@s-dorsher can I know have you completed deep learning specialisation?

neural network does work phase transitions mode but rather it cumulates input neuron states initially, provides a minimal bias for the model to learn how feature of the parameter(input) fed are trying to predict the outcome. During this machine learning process, neural network at initial point the cost is 0 as it is still, and then as the iteration or the input neuron passes the information from input neuron to hidden layer unit neuron, it is trying to understand it’s feature correlative with the outcome, during which it uses relu, tanh, sigmoid or softmax activation function based on the understanding of data or model which is worked upon.

So to use the correct statistical dynamics, it is first very much important to understand the data and its data spread which is used to create a model. This is explained in detail in Deep Learning Specialisation.

Machine Learning Specialisation, explains more of statistical relativity to data features present.

s-dorsher · December 30, 2024, 4:50pm

Deepti_Prasad:

Each input neuron is associated with a weight, which represents it’s significance of the connection between the input neuron and the output neuron where as bias is added to the input layer to provide with additional flexibility in modeling complex patterns in the input data.

So here there is no statistics or thermodynamics.

Weights indicate how much each input affects a neuron and how much it can contribute to a prediction.

Neuron are the check point for the weight to move forward to find significance in the input data to predictive outcome. During this propagation of weight from input neuron to output neuron, weight pass through layers where usually statistical mechanics is used.

as mentioned earlier it is not just complete statistics, but it is combination of computation of gradient descent, computation cost with the use of statistical mechanics in deep neural networks.

Looks like we crossed paths on this one and you posted one minute before I did, while I was typing.

However, I disagree that there is nothing here relevant to statistics or thermodynamics. The math is identical. The system can be described using the same math.

If one wanted to make measurements of a simple softmax neural network, it would be possible, in principle, I believe, to learn something about complex physical systems with interactions between “particles”. Another way of saying that, is that they’re entangled. That’s relevant to how information propogates and also macrostates of systems. I think that is extremely interesting to emergent behavior of systems like these as far as physical observable effects are concerned.

Deepti_Prasad · December 30, 2024, 4:52pm

when I mentioned this, it is related to neuron specific to input neuron and not as a whole, read the whole explanation where I mentioned when the weights pass through neurons, statistical mechanics is used.

s-dorsher · December 30, 2024, 4:55pm

Deepti_Prasad:

neural network does work phase transitions mode but rather it cumulates input neuron states initially, provides a minimal bias for the model to learn how feature of the parameter(input) fed are trying to predict the outcome. During this machine learning process, neural network at initial point the cost is 0 as it is still, and then as the iteration or the input neuron passes the information from input neuron to hidden layer unit neuron, it is trying to understand it’s feature correlative with the outcome, during which it uses relu, tanh, sigmoid or softmax activation function based on the understanding of data or model which is worked upon.

So to use the correct statistical dynamics, it is first very much important to understand the data and its data spread which is used to create a model. This is explained in detail in Deep Learning Specialisation.

Yes I know! This has all been covered already by week two of the Advanced Learning Algorithms course. I have also done reading on my own prior to this in some more theoretical books. I lack practical experience. I was hoping for some deeper discussion but it seems we are very much not having that.

No, not yet

Great? Hopefully someone will answer this question then!!! I’ve passed it along to some physicists in the mean time, I hope.

No, I think you’re still missing my point, this isn’t necessarily about which features specifically are selected.

Deepti_Prasad · December 30, 2024, 4:56pm

Good luck, Advance happy new year

s-dorsher · December 30, 2024, 5:13pm

You too!!!

conscell · December 31, 2024, 2:23am

@s-dorsher,
I’m a bit late to the party, so I just drop this link here.

SNaveenMathew · December 31, 2024, 6:39am

Yes, mostly the same after careful discretization of continuous models (Lebesgue integral/measure vs summation).

I’m not an expert, but here are two leads to bridge the gap:

Energy Based Models are inspired by physics. I learnt a bit from the neural networks course taught by Prof. Hinton, but it’s no longer available on Coursera
Probabilistic Graphical Models taught by Prof. Koller on Coursera uses a lot of the related math and intuition. I’d recommend doing a basic course on mathematical/Bayesian statistics before starting the specialization

SNaveenMathew · December 31, 2024, 7:00am

I can give you some leads, but not a specific answer

Machine Learning the Ising Transition
A friend of mine worked on Spiking Neural Networks, and we often discussed the theory - [2209.08678] Ising models of deep neural networks
Here’s an application of spiking neural networks - [2208.07502] Combinatorial optimization solving by coherent Ising machines based on spiking neural networks

The beauty of relating the math is that once the mathematical equivalence is established we can:

Use totally new methods to understand the models, behaviors, diagnostics, etc.
Use constraints to make the behavior more controllable
…

s-dorsher · January 2, 2025, 11:32pm

@conscell @SNaveenMathew

Thank you so much both of you for the wonderful links! I am going to have to take some time to look over this! This is really exciting.

I have taken advanced courses in statistics (a calculus based course that addressed transformations between variables in 1998 while I was in high school in an accelerated college math program, must admit I’m rusty, and also a class that refreshed some of this material later at another college, as well as a math methods for physics class in grad school that covered both bayesian and frequentist statistics at an intro level). I have also used statistics in research (to develop a gravitational wave search algorithm in 2008-2010, to do an analysis of how common exoplanets are in the galaxy in 2004-2006, and to assess whether or not gravitational lensing could be used to measure dark matter and dark energy in 2003-2004). All of those research experiences were frequentist statistics, though I have read a fair amount of bayesian statistics as part of a scientific collaboration (the LIGO gravitational wave detector) that has used bayesian statistics in its detections, since then. However, I take your point that it couldn’t hurt to read a bit more.

I am truly excited about the links!

I know it’s maybe not a great citation, but I have a semi-clear explanation of one of the problems I was trying to clarify, although it wasn’t the whole issue. Grok has helped me explain this a bit better than I could have without it’s help, although I could have taken the derivative myself. I hope it is not offensive or wrong to post that link. I can rewrite it by taking the derivative myself if necessary, but I think seeing it done by Grok may settle this question a bit better.

What is the actual derivative of softmax when it is minimized?

s-dorsher · January 2, 2025, 11:42pm

I don’t think anyone cares, but you can find the undergrad thesis on cosmology, the exoplanet paper, and the long gravitational wave transients… (radon transform section) paper on Research Gate if it matters lol probably not. Sadly none of the code survived for github.

Topic		Replies	Views
Why use Softmax instead of a linear transform that sums to 1? Neural Networks and Deep Learning coursera-platform	5	886	May 28, 2021
Softmax Loss Function for single example Advanced Learning Algorithms week-module-2	18	587	December 30, 2022
What is the Cost Function for Softmax? Advanced Learning Algorithms week-module-2	121	378	May 18, 2025
Softmax Regression - Constructing Custom Function Supervised ML: Regression and Classification week-module-3	13	84	August 29, 2024
Feedforward Neural Networks in Depth Deep Learning Resources coursera-platform	67	95642	July 13, 2025

How softmax relates to statistical mechanics

Related topics