Open4

llama.cpp の Q4_0 とか Q4_1 って何?

uint256_tuint256_t

https://www.reddit.com/r/LocalLLaMA/comments/139yt87/comment/jj4qpbp/

  • q4_0 = 32 numbers in chunk, 4 bits per weight, 1 scale value at 32-bit float (5 bits per value in average), each weight is given by the common scale * quantized value.
  • q4_1 = 32 numbers in chunk, 4 bits per weight, 1 scale value and 1 bias value at 32-bit float (6 bits per value in average), each weight is given by the common scale * quantized value + common bias.
  • q4_2 = same as q4_0, but 16 numbers in chunk, 4 bits per weight, 1 scale value that is 16-bit float, same size as q4_0 but better because chunks are smaller.
  • q4_3 = already dead, but analogous: q4_1 but 16 numbers in chunk, 4 bits per weight, scale value that is 16 bit and bias also 16 bits, same size as q4_1 but better because chunks are smaller.
  • q5_0 = 32 numbers in chunk, 5 bits per weight, 1 scale value at 16-bit float, size is 5.5 bits per weight
  • q5_1 = 32 numbers in a chunk, 5 bits per weight, 1 scale value at 16 bit float and 1 bias value at 16 bit, size is 6 bits per weight.
  • q8_0 = same as q4_0, except 8 bits per weight, 1 scale value at 32 bits, making total of 9 bits per weight.

まとめるとこう:

Numbers in chunk Bits per weight Scale bits Bias bits
q4_0 32 4 32 -
q4_1 32 4 32 32
q4_2 16 4 16 -
q4_3 16 4 16 16
q5_0 32 5 16 -
q5_1 32 5 16 16
q8_0 32 8 32 -
uint256_tuint256_t

全結合層、というか重みをすべて 16/32 個単位で区切って量子化してるってこと?
それでいいの?

uint256_tuint256_t

Thus, q4_2 is just a slightly improved q4_0. Both should be considered poor. The general theme is that without a separate bias value, then the model loses a bit if the run of digits contains only positive or only negative values because only other half of the quantization values can be used. q5_0 is only a little bigger than q4_2, and much better though it suffers from the same issue, and q5_1 is in fact quite close to the performance of the full-precision model. q8_0 is barely distinguishable from full precision.

The bigger the model gets, though, the less it matters. It may be acceptable to use q4_2 at 33B or bigger models, as an example. This is thought to come from redundancy in the models, that there exists multiple logical circuits of data that mostly compute the same thing and therefore the quantization error may have lesser impact. Technologies such as GPTQ use more complex approach where the objective is now to try to average out the quantization error by using the other weight values to compensate for the errors made in each weight. It is mathematical voodoo that I frankly do not understand, but by the resulting perplexity scores, it evidently works. 128 group size GPTQ often uses runs of 128 digits at 4 bits, and uses the same bias and scale value for all of them, which can be converted to q4_1 without loss, though size increases because the corresponding group size of q4_1 is only 32, so the scale and weight are repeated 4 times.

ふむ...