Why inference using dequantized model?

Junwei_Li · March 19, 2024, 5:41pm

Since the size of dequantized model is almost the same with the original model, in this lesson, why quantized the model first, and then dequnatized it and use it for inference?

tgaddair · March 19, 2024, 7:19pm

Good question! In a real world scenario, you would dequantize each layer just-in-time and then discard the dequantized weights after computing the activations to free up GPU memory. This incurs more compute overhead but retains the memory savings of quantization (as any given layer is relatively small even when dequantized).

However, writing the PyTorch code to do this is kind of messy and would have added a lot of extra time to the lesson, so we made the choice to skip such an implementation and instead focus on measuring the quantization error.

Hope that makes sense!

Topic		Replies	Views
Run the initial model after quntization function (Lecture 4)) Efficiently Serving LLMs	0	88	April 4, 2024
Questions on quantizing both activation and weights for inference layers? Quantization In Depth	0	96	May 28, 2024
🌟 New Course! Enroll in Quantization In Depth News and Announcements short-course	4	173	May 8, 2024
Quantization - Simple Practical aspects Generative AI with Large Language Models general	3	44	June 28, 2024
C3W2Lab3: Quantization model took more memory than non-quantized tflite model Machine Learning Modeling Pipelines in Production	4	592	July 18, 2021

Why inference using dequantized model?

Related topics