Quantization - Simple Practical aspects

iamchatgpt · June 27, 2024, 1:02pm

I have a few questions about quantizing large models. Let’s take the Mistral 8*7B model as an example, and assume I have an A100 GPU with 40 GB of RAM.

Scenario 1: Using the model for prompt engineering

I download the full Mistral 8*7B model and save it in a folder.
I quantize it to FP16 before prompting it so that it fits on one GPU.

Question for Step 2: When quantizing the model to FP16, does it first convert the weights to FP16 and then load them into RAM?

Scenario 2: Fine-tuning the model

I download the full Mistral 8*7B model and save it in a folder.
I quantize it to FP16 before fine-tuning it so that it fits on one GPU.
I fine-tune the model and save it.

Will the model be saved in FP16 precision after fine-tuning?

Alireza_Saei · June 27, 2024, 8:09pm

Hi @iamchatgpt

Yes, the quantization process converts the model weights to FP16 and then loads them into RAM. This helps reduce memory usage and allows the model to fit on the GPU.

Yes, if you fine-tune a model that has been quantized to FP16, the model will be saved in FP16 precision after fine-tuning, if the fine-tuning process maintains the FP16 precision.

Hope this helps!

iamchatgpt · June 28, 2024, 7:50am

Thank you @Alireza_Saei

Alireza_Saei · June 28, 2024, 7:54am

You’re welcome, happy to help

Topic		Replies	Views
Week 1: Computational challenges of training LLMs Generative AI with Large Language Models large-language-model , llm	3	34	November 18, 2024
Fine-Tuning a Large Language Models with QLoRA and PEFT(LLMs) Generative AI with Large Language Models week-2	11	724	July 15, 2023
Fine tune the mode on GPU Generative AI with Large Language Models week-2	4	657	August 3, 2023
A table for Model size x Compute might be wrong? Finetuning Large Language Models	0	78	September 21, 2023
Week 2: Instruction fine-tuning Generative AI with Large Language Models llm , prompting	1	59	November 18, 2024

Quantization - Simple Practical aspects

Related topics