I have a few questions about quantizing large models. Let’s take the Mistral 8*7B model
as an example, and assume I have an A100 GPU with 40 GB of RAM.
Scenario 1: Using the model for prompt engineering
- I download the full
Mistral 8*7B model
and save it in a folder. - I quantize it to
FP16
before prompting it so that it fits on one GPU.
Question for Step 2: When quantizing the model to FP16
, does it first convert the weights to FP16
and then load them into RAM?
Scenario 2: Fine-tuning the model
- I download the full
Mistral 8*7B model
and save it in a folder. - I quantize it to
FP16
before fine-tuning it so that it fits on one GPU. - I fine-tune the model and save it.
Will the model be saved in FP16
precision after fine-tuning?