Quantization - Simple Practical aspects

I have a few questions about quantizing large models. Let’s take the Mistral 8*7B model as an example, and assume I have an A100 GPU with 40 GB of RAM.

Scenario 1: Using the model for prompt engineering

  1. I download the full Mistral 8*7B model and save it in a folder.
  2. I quantize it to FP16 before prompting it so that it fits on one GPU.

Question for Step 2: When quantizing the model to FP16, does it first convert the weights to FP16 and then load them into RAM?

Scenario 2: Fine-tuning the model

  1. I download the full Mistral 8*7B model and save it in a folder.
  2. I quantize it to FP16 before fine-tuning it so that it fits on one GPU.
  3. I fine-tune the model and save it.

Will the model be saved in FP16 precision after fine-tuning?

Hi @iamchatgpt

Yes, the quantization process converts the model weights to FP16 and then loads them into RAM. This helps reduce memory usage and allows the model to fit on the GPU.

Yes, if you fine-tune a model that has been quantized to FP16, the model will be saved in FP16 precision after fine-tuning, if the fine-tuning process maintains the FP16 precision.

Hope this helps!

1 Like

Thank you @Alireza_Saei

You’re welcome, happy to help :raised_hands: