In the lecture video called
Post training quantization, at 2min 4 secs the lecturer appears (to me) to be suggesting that “fully fixed point inference” is something like an optimal method for reducing latency during inference computation:
“With dynamic range quantization, during inference, the weights are converted from eight bits to floating point, and the activations are computed using floating point kernels. This conversion is done once and cached to reduce latency. This optimization provides latencies which are close to fully fixed point inference.”
My question is: what does “fully fixed point inference” actually mean ?
hi @shahin ,
You may have known fixed-point number representation is a way of representing the floating-point numbers using a fixed numbers of bits.
Some work has shown that using fixed-point gives better performance in inference compared to using floating-point. The reason behind this is processing fixed-point is mostly the same as processing integers. So in my understanding, “fully fixed point inference” is to use fixed-point representation for model parameters and do the inference based on that, which can exploit hardware accelerator and is faster compare to floating-point based inference.
(For reference: this work also gives some explanation about fixed-point representation.)
Hope it helps,