When we proceed backprop to calculate derivatives, it seems the whole process would produce some round off errors and be less precise , why can we ignore the influence of that ?
Itâs an interesting and perceptive question. Everything we do here (forward propagation in addition to back prop) is in floating point arithmetic, which is a finite representation by definition. We have literally only 2^{32} or 2^{64} numbers that we can represent between -\infty and +\infty, depending on whether we use 32 bit or 64 bit floats. That is completely pathetic compared to the abstract beauty of \mathbb{R}, but we donât have a choice. There is no efficient way for computer software to deal with the uncountably infinite properties of the pure math version of all this.
It turns out that mathematicians have thought carefully about the issues here. There is an entire subfield of mathematics called Numerical Analysis that deals with finite representations, among other things. It turns out you can reason precisely about the error propagation properties of different algorithms, when they are subject to rounding errors. You can have ânumerically stableâ computations in which the rounding errors roughly cancel out on a statistical basis and are bounded or you can have âunstableâ computations in which the rounding errors tend to compound and become unbounded. So the algorithms we are using have been carefully chosen to be the type that are âstableâ. That means that, yes, we will always have rounding errors, but they donât prevent us from finding valid solutions.
I think because the errors induced are extremely small compared to the magnitude of the features.
@paulinpaloalto @TMosh @bupeigon I donât claim to be an expert on this, but I also have read some papers lately where they are driving things down to 8-bit (or even lower) for the sake of expediency.
*If I had to take a wild guess, I think the magnitude of the models in play silences out the error (similar to what @TMosh says).
Do you have a link to any such papers that we could peruse?
I gave the answer above: everything we do is an approximation. If your algorithms are properly designed, then the rounding errors donât compound and donât end up preventing us from getting an approximate solution that is âclose enough for jazzâ and works. My hunch is that whatever you are referring to here about dealing in 8 bit spaces is addressing something else.
It has been demonstrated that you can land a spacecraft on Mars with 64 bit floating point. Close enough for government work.
Lemme look. I think this is in the context of LLMs. Personally Paul, since I know hardware-- Iâve always wanted to spin one of these things in pure analog. Not there yet
@paulinpaloalto not the most definitive. Just a quick search:
https://www.eetimes.com/ibm-brings-8-bit-ai-training-to-hardware/
Previously, Iâve had the âexperienceâ of doing floating point math in assembly language on a machine that only supported 8-bit integers. It was not a lot of of fun.
I greatly enjoy the current crop of high-level programming tools and math libraries.
@TMosh actually what inspired me to get back into this whole mess was working with real world machines or physical hardware⌠You have to âbounce switchesâ and all sorts of crazy stuff. Plus you have to deal with race conditions and how the signal is running around the board.
In contrast, working on a desktop today is âpretty easyâ.
With enough memory and a fast enough processor and high-level programming tools, software becomes all too easy.
@TMosh Apocryphal: But 640k should be enough for everybody, am I right ?
Indeed.