Can an LLM be approximated by 2 or more low-rank matrices? _ I understand that the activations terms could be a major challenge. Nevertheless are there low rank matrices whose products and addition would approximate a LLM?
Absolutely. While low-rank matrix factorization can be applied to approximate specific weight matrices within a large language model (LLM), achieving a comprehensive approximation of the LLM’s functionality is intricate. This is due to the non-linear and intricate nature of the LLM’s activations and interactions, which may not be fully captured by the low-rank matrix representation, particularly for complex language processing tasks.
Thank you, @aryan010204, for indulging my curiosity.
I am making a purely intuitive argument. Relu activations are piecewise linear (unless I use a fancy one). If I only use old classical Relu as my non-linear functions, why won’t the overall function (NN) be piecewise linear? The whole NN intuitively should have a piecewise linear approximation. I might have to stitch a bunch of matrices together the way a Relu is stitched together (a zero function for the negative values and a linear term for the positive value). It feels like one is dealing with a bunch of hyperplanes in a “positive octant” (what is an octant in 60,000 dimensions).
The only thing that remains is the softmax. Does it have a Taylor expansion (it has to)? There may also be a piecewise linear approximation to a softmax.