I want to train a VLM from LLM but when I inference, does the input have to contain image embedding and if I train with PEFT, will the other tasks of LLM be weakened?
My goal is that I want to keep the strengths of other tasks on LLM while it learns more about vision and when inferring the user can flexibly change between adding and not adding to the image. So is there a way?
Can I build something similar to BLIP2 but with an LLM of my own choosing?
Many Thanks
The PEFT training does not interfere with the original LLM weights it just adds an additional trained matrix attached to the original LLM. Maybe when you use the VLM as you call it without image you can omit the PEFT fine tuned model in some way and go straight to the original LLM.
Or Maybe you can send separate inputs to the original model and the PEFT model as required to have responses more aligned with your expectations, but otherwise the PEFT model will have additional weights and probably wont give the same answer as the original LLM.