Hi, I have the following questions: in the short video Custom Build an 8-bit Quantizer, the output of the linear layer is said to be: (input * weights*) * scale + bias.
- Does this silently assume that we did symmetric quantization in the weights so the zero point is mapped to zero?
- Does this silently assume that we quantized the whole weight matrix and we did not do per-row or per-group quantization?
- What is the purpose of defining the weights in the
__init__method since, if I understand it correctly, their values are never used. Same goes for scales, they are calculated in thequantizemethod.
Any help would be appreciated.