CLIP model quantized by quanto run slower

kechan · June 3, 2024, 4:14pm

I have noticed that after quanto, the model runs quite a bit slower than the original. I logged an issue on GitHub but was told this is “normal”. while the saving in memory is more obvious, inference latency is not guaranteed. I was told this is rather complicated and need optimization. Here’s my colab notebook with some simple timings.

I am posting here just in case others have tried and see something similar, and if the huggingface instructors can say something.

I will take the In depth short course on this topic and hope I understand why it is slow, and if I can improve it better. The technical explanation is rather opaque for someone not familiar, and it gets pretty low level at times.

Topic		Replies	Views
Saving a quantized model Quantization Fundamentals with Hugging Face	0	368	April 17, 2024
Llama3.2 from Huggingface in Google Colab AI Discussions ai-discussions	6	342	November 7, 2024
For how long where the trax and huggingface models trained or finetuned? NLP with Attention Models week-module-3	1	316	November 1, 2023
C3 week 4 asignment Advanced Computer Vision with TensorFlow week-module-4	3	472	January 16, 2024
Load model directly Open Source Models with Hugging Face	2	30	July 19, 2024

CLIP model quantized by quanto run slower

Related topics