# T5 GGUF Analysis This document records the T5-small GGUF evaluation run. ## Environment Verified runtime: | item | value | | ---- | ----- | | Python | `3.11.12` | | Torch | `2.9.0+cu129` | | Torch CUDA | `12.9` | | CUDA available | `True` | | GPU | `NVIDIA GeForce RTX 3070 Laptop GPU` | ## Models The run evaluated these GGUFs: | model | role | | ----- | ---- | | `t5-small-f32.gguf` | unquantized reference baseline | | `t5-small-f16.gguf` | high-precision comparison and quantization source | | `t5-small-q8_0.gguf` | quantized | | `t5-small-q5_k_m.gguf` | quantized | | `t5-small-q4_k_m.gguf` | quantized | | `t5-small-q4_0.gguf` | quantized | | `t5-small-q3_k_m.gguf` | quantized | | `t5-small-q2_k.gguf` | quantized | ## Conversion Check Results The conversion check compares greedy HF outputs against greedy f32 GGUF outputs. It validates that the unquantized GGUF is a usable reference before comparing quantized models against it. | dataset | examples | exact match | chrF | first token match | | ------- | -------: | ----------: | ---: | ----------------: | | CoLA | 2,000 | 1.000 | 1.000 | 1.000 | | summarization | 2,000 | 0.117 | 0.953 | 0.990 | | translation en-de | 2,000 | 0.993 | 0.996 | 1.000 | | translation en-fr | 2,000 | 0.986 | 0.995 | 1.000 | | **overall** | **8,000** | **0.774** | **0.986** | **0.997** | Interpretation: - The f32 GGUF tracks HF closely overall. - Summarization has low exact match but high chrF, which points to wording differences rather than broad conversion drift. - Translation and CoLA are effectively matching at the output level. ## Generation Results Generation used greedy decoding with `n_predict=64`. Agreement and similarity are measured against the f32 GGUF baseline output. | model | agreement vs f32 | similarity vs f32 | | ----- | ---------------: | ----------------: | | `t5-small-f16` | 0.990 | 0.998 | | `t5-small-q8_0` | 0.723 | 0.947 | | `t5-small-q5_k_m` | 0.526 | 0.889 | | `t5-small-q4_k_m` | 0.474 | 0.870 | | `t5-small-q4_0` | 0.417 | 0.837 | | `t5-small-q3_k_m` | 0.375 | 0.814 | | `t5-small-q2_k` | 0.287 | 0.660 | Per-dataset generation metrics: | dataset | model | exact match vs reference | chrF vs reference | agreement vs f32 | similarity vs f32 | | ------- | ----- | -----------------------: | ----------------: | ---------------: | ----------------: | | CoLA | `t5-small-f16` | 0.697 | 0.950 | 1.000 | 1.000 | | CoLA | `t5-small-f32` | 0.697 | 0.950 | - | - | | CoLA | `t5-small-q2_k` | 0.697 | 0.950 | 1.000 | 1.000 | | CoLA | `t5-small-q3_k_m` | 0.697 | 0.949 | 1.000 | 1.000 | | CoLA | `t5-small-q4_0` | 0.697 | 0.950 | 0.995 | 1.000 | | CoLA | `t5-small-q4_k_m` | 0.698 | 0.950 | 0.999 | 1.000 | | CoLA | `t5-small-q5_k_m` | 0.697 | 0.950 | 1.000 | 1.000 | | CoLA | `t5-small-q8_0` | 0.697 | 0.950 | 1.000 | 1.000 | | summarization | `t5-small-f16` | 0.000 | 0.133 | 0.979 | 0.995 | | summarization | `t5-small-f32` | 0.000 | 0.133 | - | - | | summarization | `t5-small-q2_k` | 0.000 | 0.068 | 0.000 | 0.254 | | summarization | `t5-small-q3_k_m` | 0.000 | 0.123 | 0.039 | 0.510 | | summarization | `t5-small-q4_0` | 0.000 | 0.123 | 0.071 | 0.550 | | summarization | `t5-small-q4_k_m` | 0.000 | 0.131 | 0.137 | 0.642 | | summarization | `t5-small-q5_k_m` | 0.000 | 0.128 | 0.210 | 0.689 | | summarization | `t5-small-q8_0` | 0.000 | 0.133 | 0.541 | 0.852 | | translation en-de | `t5-small-f16` | 0.020 | 0.361 | 0.989 | 0.999 | | translation en-de | `t5-small-f32` | 0.020 | 0.361 | - | - | | translation en-de | `t5-small-q2_k` | 0.015 | 0.315 | 0.090 | 0.738 | | translation en-de | `t5-small-q3_k_m` | 0.018 | 0.353 | 0.234 | 0.876 | | translation en-de | `t5-small-q4_0` | 0.019 | 0.357 | 0.304 | 0.905 | | translation en-de | `t5-small-q4_k_m` | 0.019 | 0.359 | 0.380 | 0.920 | | translation en-de | `t5-small-q5_k_m` | 0.019 | 0.359 | 0.448 | 0.935 | | translation en-de | `t5-small-q8_0` | 0.019 | 0.360 | 0.680 | 0.970 | | translation en-fr | `t5-small-f16` | 0.017 | 0.381 | 0.993 | 0.999 | | translation en-fr | `t5-small-f32` | 0.017 | 0.381 | - | - | | translation en-fr | `t5-small-q2_k` | 0.007 | 0.276 | 0.057 | 0.646 | | translation en-fr | `t5-small-q3_k_m` | 0.015 | 0.368 | 0.226 | 0.868 | | translation en-fr | `t5-small-q4_0` | 0.015 | 0.372 | 0.299 | 0.891 | | translation en-fr | `t5-small-q4_k_m` | 0.017 | 0.377 | 0.380 | 0.919 | | translation en-fr | `t5-small-q5_k_m` | 0.016 | 0.380 | 0.446 | 0.933 | | translation en-fr | `t5-small-q8_0` | 0.016 | 0.380 | 0.672 | 0.967 | Interpretation: - `f16` is effectively equivalent to `f32` for generated outputs. - `q8_0` preserves most behavior but still diverges on longer-form tasks. - `q5_k_m` and `q4_k_m` are usable middle points depending on size and quality target. - `q2_k` degrades heavily for summarization and translation. ## Perplexity And KL Results Perplexity is reported per dataset. KL/token and top-1 disagreement are the main quantization drift metrics because they compare each quantized model directly against f32 token distributions. Token-weighted summary across all datasets: | model | tokens | KL/token | top-1 disagree | | ----- | -----: | -------: | -------------: | | `t5-small-f16` | 308,028 | 0.00000 | 0.0005 | | `t5-small-f32` | 308,028 | - | - | | `t5-small-q8_0` | 308,028 | 0.00187 | 0.0160 | | `t5-small-q5_k_m` | 308,028 | 0.01004 | 0.0386 | | `t5-small-q4_k_m` | 308,028 | 0.02038 | 0.0521 | | `t5-small-q4_0` | 308,028 | 0.04847 | 0.0704 | | `t5-small-q3_k_m` | 308,028 | 0.05892 | 0.0897 | | `t5-small-q2_k` | 308,028 | 0.27523 | 0.1914 | Per-dataset perplexity: | model | CoLA | summarization | translation en-de | translation en-fr | | ----- | ---: | ------------: | ----------------: | ----------------: | | `t5-small-f32` | 1.3490 | 138.5925 | 5.0317 | 3.8267 | | `t5-small-f16` | 1.3491 | 138.6029 | 5.0317 | 3.8268 | | `t5-small-q8_0` | 1.3494 | 133.1739 | 5.0314 | 3.8245 | | `t5-small-q5_k_m` | 1.3498 | 139.2235 | 5.0748 | 3.8488 | | `t5-small-q4_k_m` | 1.3535 | 155.2379 | 5.1135 | 3.8759 | | `t5-small-q4_0` | 1.3593 | 215.7687 | 5.1394 | 3.9305 | | `t5-small-q3_k_m` | 1.3490 | 153.6497 | 5.2163 | 3.9680 | | `t5-small-q2_k` | 1.3577 | 262.6867 | 6.0281 | 4.4851 | Per-dataset KL/token: | model | CoLA | summarization | translation en-de | translation en-fr | | ----- | ---: | ------------: | ----------------: | ----------------: | | `t5-small-f16` | 0.00000 | 0.00000 | 0.00000 | 0.00000 | | `t5-small-q8_0` | 0.00029 | 0.00194 | 0.00191 | 0.00181 | | `t5-small-q5_k_m` | 0.00544 | 0.01159 | 0.00923 | 0.00838 | | `t5-small-q4_k_m` | 0.00811 | 0.02593 | 0.01732 | 0.01437 | | `t5-small-q4_0` | 0.01239 | 0.07497 | 0.02886 | 0.02339 | | `t5-small-q3_k_m` | 0.00539 | 0.07696 | 0.04827 | 0.04073 | | `t5-small-q2_k` | 0.00350 | 0.36274 | 0.22476 | 0.18650 | Interpretation: - The KL ranking is stable and clear: `f16`, `q8_0`, `q5_k_m`, `q4_k_m`, `q4_0`, `q3_k_m`, then `q2_k`. - `q8_0` has very small distributional drift from f32. - `q5_k_m` is the strongest compact quantization in this run. - `q4_k_m` is materially better than `q4_0` by KL/token and top-1 disagreement. - `q2_k` has high drift and large top-1 disagreement on generation-heavy datasets. ## Recommended Default For T5-small in this workflow: - Use `t5-small-f32.gguf` as the reference baseline. - Use `t5-small-q8_0.gguf` when preserving behavior matters most. - Use `t5-small-q5_k_m.gguf` as the best compact default from this run. - Use `t5-small-q4_k_m.gguf` only when size pressure is stronger than quality. - Avoid `t5-small-q2_k.gguf` for summarization or translation quality checks. GOOGLE T5-small License: Apache 2.0 We followed and adopted their licnese.