I have understood the concept of latency in quantization. But, I would like to know more about what are the dependency factors of latency in a model. Does more parameters in the model decrease latency? What other factors influence latency?
Hi Sasi! Yes, the model size does have a big impact on the latency. Also to note, lower is better when talking about latency so more parameters will increase it instead of decrease. Other factors can include the device where the model is deployed or the serving infrastructure that accepts the requests. This is a huge topic and generally, latency can be improved in the model and serving level. You can read more about that here and here . Hope this helps.