CLIP model quantized by quanto run slower

I have noticed that after quanto, the model runs quite a bit slower than the original. I logged an issue on GitHub but was told this is “normal”. while the saving in memory is more obvious, inference latency is not guaranteed. I was told this is rather complicated and need optimization. Here’s my colab notebook with some simple timings.

I am posting here just in case others have tried and see something similar, and if the huggingface instructors can say something.

I will take the In depth short course on this topic and hope I understand why it is slow, and if I can improve it better. The technical explanation is rather opaque for someone not familiar, and it gets pretty low level at times.