C3W2Lab3: Quantization model took more memory than non-quantized tflite model


I have printed the sizes of various trained models in the quantization and pruning lab. I noticed that the quantization aware training (QAT) model occupies more memory (actually similar to base line) than non quantized tflite model. I understand probably this is because of more parameters than the baseline. But still, can we consider this as a significant improvement of memory? What can we do more to decrease the memory requirement?

hi @gsasikiran ,

I try to re-produce by running the notebook both on Colab and on my local machine and see the memory did reduce to around a forth :

The reduction comes from converting the 32-bit representations (float) into 8 bits (integer), so it should have reduced the size anyway.

Can you repeat the same observation by “Restart and run all” the run time?
Is there any error that occurred during quantization in the cell above?

Hello @tranvinhcuong.
I see that you are implementing post-training quantization again after QAT. I mean by the fact that doesn’t QAT itself should reduce the model size in the first place without post-training quantization. I have mentioned my results without post-training quantization on QAT model.

Correct me if I am wrong.

hi @gsasikiran , I see what you meant now.

In my understanding, the quantization aware training does not reduce the memory size by itself. The post quantization actually truncates the parameters to reduce memory size, this may reduce the performance of the model overall. Quantization aware training introduce fake quantization nodes into the model, then do the training. This will make the model more robust to quantization, which will be applied later.

Below is the transcript I copied from the lecture

The core idea is that quantization aware training simulates low precision inference time computation in the forward pass of the training process. By inserting fake quantization nodes, the rounding effects of quantization are assimilated in the forward pass, as it would normally occur in actual inference. The goal is to fine-tune the weights to adjust for the precision loss. If fake quantization nodes are included in the model graph at the points where quantization is expected to occur, for example, convolutions. Then in the forward pass, the flood values will be rounded to the specified number of levels to simulate the effects of quantization. This introduces the quantization error as noise during training and is part of the overall loss which the optimization algorithm tries to minimize. Here, the model learns parameters that are more robust to quantization.

Hope it helps,


Thank you @tranvinhcuong. I got it now.