So say I have 6-7 different tasks and I do PEFT with DistilBERT using LoRA. That would mean I have 6-7 LoRA matrices.
When doing inference for these tasks, can I do inference for the tasks in parallel or do I have to go sequentially one at a time for each task?
I’d like to think if I have different python files or different methods to run inference for each task, I can run inference for any task at any time.
The reason I ask this is that to do inference for a different task one has to swap out the LoRA matrix each time a different task is chosen to do inference on.
It should be possible to run on parallel, just create several copies of the model and then attache different matrices on to them to perform the different tasks!
is there any way to do this without creating copies of the base DistilBERT to which we attach the matrix(es). Is there any way to do this with just one copy of the base DistilBERT?
I dont know specifically about this, but at Tensorflow Advanced Specialisation they introduce the parallel running of a model! Maybe that can help you!
hi @Ayush_Nigade
Did you try QLoRA where quantized the weights of LoRA adapter to lower precision.
The. probably you can save up the multiple model matrix and train parallely as well.
You don’t need to make multiple copies of base DistilBERT.
I suppose @gent.spah is stating for the model you create using the base DistilBERT
1 Like