I have some prior background in freqeuentist statistics and a little bit of Bayesian statistics. This post is with reference to frequentist statistics.
In frequentist statistics, there is a true population measured by the variable y and a sample population measured by the variable yhat. The sample population is what is measured in an experiment and the true population exists in some abstract greater underlying reality, which theory aspires to explain. The practical result of explaining underlying reality, is that it will also predict future statistical experiments.
Iâm not very familiar with machine learning yet. But I was surprised to see the definition of yhat and y (I am definitely not arguing with them). It looks like y is the measurement of reality and yhat is the prediction based on theory (the hypothesis).
I was pleased to see that there is some concept of hypothesis testing, though I donât know the details yet. I realize that in a lot of practical applications that doesnât matter, but I know that machine learning is increasingly used in science, and Iâm interested in knowing more about those applications, as well as the practical ones. Itâs great that I can see the relationship to more traditional statistics developing, even if it isnât direct.
But does anyone know why the difference in the choice of philosophy as to which variable constitutes a measurement and which constitutes a theory or fundamental truth about reality?
Itâs entirely possible that Iâm just not getting the full subtlety of your questions here, but I would describe whatâs happening here a bit differently.
In Supervised Machine Learning, we have some actual data that reflects reality. For example, suppose we are trying to build an ML system that can take photographic images as input and then tell us whether any given image is a picture of a cat or not. We start with a collection of âtraining dataâ, which is a set of images and âlabelsâ (which we arbitrarily call y) that are the true values for those images: do they contain a cat or not? That is ârealityâ. Now the goal is to create a mathematical function that takes images as inputs and produces a prediction (which we arbitrarily call \hat{y}) that is as accurate as possible. The way we do that is to create a Neural Network with some particular architecture and then train it on the training data by using a loss function that measures the accuracy of the predictions made by our model. Using the derivatives of the loss values w.r.t. the various weights of the model, it turns out the training process can produce (âlearnâ) a model that works reasonably well in a lot of cases.
What we will learn as we go through the MLS specialization and then the DLS specialization is various architectures that we can use for creating those mathematical prediction functions and how we can train them successfully.
Of course Iâve glossed over a lot of details in the above. Does that help in clarifying anything related to your questions?
Reading your bio again, maybe the problem is that you are âoverthinkingâ this and assuming itâs like physics. In physics we do actually have an observed reality and we need to come up with a theory, expressed in mathematical functions, that predicts the observed reality to within \epsilon, which is the accuracy that our current instruments allow us to measure. Anything less is a failure by definition.
In Machine Learning, there is no âtheoryâ we are trying to create. E.g. there is no âtheoryâ for how to determine whether a digital image contains a cat or not, right? But any human older than 5 can tell you the answer just by looking.
Now the question is whether we can train a neural network to learn the patterns required to make that determination. With sufficient good quality input training data and an appropriate neural network as the function, it turns out we can in most cases.
Machine Learning and Statistics use different techniques and methods to reach essentially the same results. The two concepts evolved separately and by different pools of researchers. So itâs not a surprise there is little consistency in the details.
Partially agree with @TMosh. While ML and statistical modeling can be viewed as totally independent, many ML models can be mathematically formalized (i.e. they are exactly the same) as statistical models. Some (eg: linear/logistic regression) are natural formalizations, some (eg: random forest as a data-adaptive kernel method) are seemingly force-fitted (debatable) formalizations. In practice none of these should matter if accuracy of prediction is the ultimate goal of modeling.
After a decade of mentoring Andrewâs introductory ML courses, my experience is itâs not uncommon for folks with a strong statistics backgrounds to be surprised at how ML is presented.
The question is because ML is increasingly used in physics itself, so itâs not quite correct to say that itâs irrelevant to the context of physics. There are contexts where it represents a measurement and a hypothesis in a data analysis. Or in other sciences. The Nobel Prize in physics this year was given for MLâŠ
Of course youâre right, but that was not my point. I thought you were making a pretty specific point about the meaning of the y and \hat{y} values and what it means to make a âpredictionâ in the context of ML.
It is in fact like physics. Machine learning is used in physics. The Nobel Prize was given for Machine Learningâs use in physics, among other physical science. That is because it is a fact that it is used in physics. So the statement that there is no theory, and that the âanswerâ (target variable) has no dependence upon any measurement, is frankly not true in the context I am asking about.
This is certainly a rabbit hole. From a purist perspective a good chunk of ML theory arises from physics - mainly statistical mechanics. From a neutral perspective (may be seen as a restatement of the scientific method), the model that arises from the correct inductive bias and produces the best forecast should be considered the current state-of-the-art understanding of the underlying phenomenon.
With the right amount of mathematical rigor it is possible to bridge the two perspectives - have a ML model with a sound inductive bias thatâs interpretable using physics. But there are constraints:
Not all ML models fit within existing frameworks of physics
The mathematical translation of a fitted function into insights in physics may lead to further debate
An example for #2: Suppose we have observational data (i.e. size, positions and masses of objects) for the n-body problem. If we build a ML model to predict gravitational force instead of using Einsteinâs approach or Newtonâs law, we cannot account for unmeasured variables like dark matter. Even if we âmeasure dark matterâ (currently not possible), the interpretation of the ML model may be against interpretations of current popular models like Einsteinâs general theory of relativity. The importance given to contexts (eg: amount and distribution of dark matter) may not be proportional to the amount of such (measured) contexts in the real-world data.
Definitely true. Verifying and falsifying and discovering new things are certainly part of the points of using statistics in science. By the way, there is a recent result that shows something new about the cosmological constant not being entirely constant, though it was not the level of statistical significance that one would normally require to write such a discovery in stone. Thatâs not bad physics, thatâs new and interesting physics, and the sort of long sought after hope people have had, to see if we can determine if it is simply a constant, or if there is another explanation. I personally was not involved in that effort, and couldnât begin to tell you the details. But I do think itâs exciting, since the discussion of the hope of measuring some evolution in the cosmological constant, and possibly reasons for that, such as the existence of exotic particles, goes back decades. I know the idea had already existed quite a while when I was working on assessing a measurement technique for the cosmological constant using strong gravitational lensing using a monte carlo simulation for my undergrad thesis in 2004. That said, I think the standard dark matter/cosmological constant/normal matter/radiation model of the universe (lambdaCDM) is still the primary one in use. Youâre right, though, that there are very tiny details of deviations from it in some experiments. Nothing large. Iâm not sure whether those analyses use ML or bayesian or frequentist statistics, or something else.
I guess that is in a sense part of my question. Iâm definitely not qualified to totally review all of these different experiments and address the âHubble Tensionâ myself. But I have always been interested in the interface between experiment and data analysis and theory in a variety of fields, and have medium level expertise in several (masters level) on that topic. I think this is the rabbit hole dive you were referring to, as I try to understand the ML vs bayesian vs frequentist statistics effect in terms of how that effects the interpretation of scientific results.
Probably true and good. Hopefully the people using ML are using it wisely
I do believe there are people using features in the ML models inspired by physical theories, rather than chosen at random or only inspired by the data. However, I canât personally attest to this, since this isnât my expertise in science. That, actually, is why Iâm trying to learn
Partly yes. Taking another example - Alpha Fold. There are pure physics (statistical mechanics) based models that try to solve protein folding. From my understanding though those models have the right inductive bias, they get stuck in some type of local minima (some models use Genetic Algorithm to alleviate the problem, but the probability of getting to the global optima is extremely low). Alpha Fold takes a completely different approach - I havenât gone through the details in depth to comment on the physics, but itâs almost a total black box model. Between the two itâs best to use the scientific method - the model with the best out of sample prediction is âbetterâ. In the future we may discover mathematical formulations that uncover interpretable inductive biases from Alpha Fold. Or may be we humans wonât be able to interpret those inductive biases - why should we assume the universe works in human interpretable ways?
Iâve heard of that, though I donât know the details. Iâm not sure whether it means that they use features that arenât inspired by existing physical theories, and in that sense, itâs similar to finding a phenomenological solution, but with machine learning, or if it means something else.
In a sense, if AlphaFold has features that make those predictions, those actually are the physical laws that underly the process. Any theory that reliably predicts an outcome in nature, whether or not it connects sensibly to other existing theories, can be correct (and a theory).
I guess the question, then, is, what is the statistical meaning of that chain of those features? Can they be written down in closed form so that they can be replicated, as is required by science, at all? If it is a mathematical statement that can be replicated by other data sets, then, maybe it is a law of nature.
Yes, considering deep neural networks as a sequence of matrix operation.
To provide some sort of closure, my opinion is - yes, those are our current interpretations of how the universe works even though we may not be able to find other examples (replicates) or to reverse engineer the phenomenon that led to the outcome (however, this will change if/when we collect data about the underlying process and model the process to refine our inductive bias - eg: https://fold.it/ uses crowd-sourcing to collect data on the mechanism of protein folding)
Iâm starting the next course, and I can see that part of the reason ML works is that it picks out patterns that ARENT randomly distributed, but rather have some complex but very real pattern, while statistics works for patterns that have an underlying probability distribution, whether it is due to experimental uncertainty or to a fundamental property of nature. That probability distribution can be characterized by some law of nature, but there is still a random element. Probably there is still a random element somewhere in the ML process as well, but it is much smaller than the patterns that emerge, and hence the use cases really are different.