Since the size of dequantized model is almost the same with the original model, in this lesson, why quantized the model first, and then dequnatized it and use it for inference?
1 Like
Good question! In a real world scenario, you would dequantize each layer just-in-time and then discard the dequantized weights after computing the activations to free up GPU memory. This incurs more compute overhead but retains the memory savings of quantization (as any given layer is relatively small even when dequantized).
However, writing the PyTorch code to do this is kind of messy and would have added a lot of extra time to the lesson, so we made the choice to skip such an implementation and instead focus on measuring the quantization error.
Hope that makes sense!