Need math help with an idea for incremental learning

I am facing a need for incremental learning as new samples keep coming in with time for my model.

An idea came to mind and I wanted to bounce this off you, and potentially implement it together with a mathematician, who is interested to participate in breaking new grounds.

So, here is what I think.

In loss function space, gradient descent is a number of epoch step vectors that eventually converge in a minimum. Each step is calculated from a batch of N samples. It would make sence to say, that each sample has it’s own vector that “pulls” descent into direction of its own loss minimum.

Now, let’s pretend that we train model in a way, that preserves these individual vectors for each weight. Now sum of these vectors must be the resulting batch descent vector.

I may be horribly wrong, but if sum them up not batch-wise, but epoch-wise, so that each sample has it’s own sum from start to end, then adding new sample would consist of two steps: 1. Training a model with this one new sample and 2. Summing up with other sample vectors.

The aim of this pondering is to overcome problem of having to re-train on the entire dataset when new samples are added, and, of course, catastrofic forgetting/overfitting

Perhaps, some adjustments need to be made to take it to a live theory, but this is where I need math help.

Alexander Podgorny

I may be misunderstanding, but it seems the method you describe is well-known as stochastic gradient descent. It’s essentially the same as batch gradient descent, but with a batch size of 1.

You can run another step of SGD every time you get a new example.

Not exactly, instead of suming gradients across batch, I describe suming them across epochs after training has been done, so that additional examples can fine-tune model as if they participated in the gradient descent along with entire set, but without having to actually do all the calculations again.

Say, we have samples A and B. We run training across X epochs. Calculate vectors vec_A and vec_B for each weight by suming them across X epochs.
Then, when C sample arrives, we run descent on C alone, build vec_C. (Assuming the same initial random state of all weights). Then do some operation on all vec_* vectors (possibly sum) to point to common loss minimum.

My question is if this is mathematically possible to calculate common minimum from these vectors. And if yes – how.

The difference from BGD is that we may take advantage of caching gradient values from previous training as resulting sums of entire (or large–chunked) process, and speed up model update with new samples. Sort of like batches, only not across data, but across epochs. (time-batches? :slight_smile: )

Sounds like running batch GD initially, then stochastic GD on later examples.

I must be missing something.