# T5 GGUF Analysis

This document records the T5-small GGUF evaluation run. 

## Environment

Verified runtime:

| item | value |
| ---- | ----- |
| Python | `3.11.12` |
| Torch | `2.9.0+cu129` |
| Torch CUDA | `12.9` |
| CUDA available | `True` |
| GPU | `NVIDIA GeForce RTX 3070 Laptop GPU` |

## Models

The run evaluated these GGUFs:

| model | role |
| ----- | ---- |
| `t5-small-f32.gguf` | unquantized reference baseline |
| `t5-small-f16.gguf` | high-precision comparison and quantization source |
| `t5-small-q8_0.gguf` | quantized |
| `t5-small-q5_k_m.gguf` | quantized |
| `t5-small-q4_k_m.gguf` | quantized |
| `t5-small-q4_0.gguf` | quantized |
| `t5-small-q3_k_m.gguf` | quantized |
| `t5-small-q2_k.gguf` | quantized |

## Conversion Check Results

The conversion check compares greedy HF outputs against greedy f32 GGUF outputs.
It validates that the unquantized GGUF is a usable reference before comparing
quantized models against it.

| dataset | examples | exact match | chrF | first token match |
| ------- | -------: | ----------: | ---: | ----------------: |
| CoLA | 2,000 | 1.000 | 1.000 | 1.000 |
| summarization | 2,000 | 0.117 | 0.953 | 0.990 |
| translation en-de | 2,000 | 0.993 | 0.996 | 1.000 |
| translation en-fr | 2,000 | 0.986 | 0.995 | 1.000 |
| **overall** | **8,000** | **0.774** | **0.986** | **0.997** |

Interpretation:

- The f32 GGUF tracks HF closely overall.
- Summarization has low exact match but high chrF, which points to wording
  differences rather than broad conversion drift.
- Translation and CoLA are effectively matching at the output level.

## Generation Results

Generation used greedy decoding with `n_predict=64`. Agreement and similarity
are measured against the f32 GGUF baseline output.

| model | agreement vs f32 | similarity vs f32 |
| ----- | ---------------: | ----------------: |
| `t5-small-f16` | 0.990 | 0.998 |
| `t5-small-q8_0` | 0.723 | 0.947 |
| `t5-small-q5_k_m` | 0.526 | 0.889 |
| `t5-small-q4_k_m` | 0.474 | 0.870 |
| `t5-small-q4_0` | 0.417 | 0.837 |
| `t5-small-q3_k_m` | 0.375 | 0.814 |
| `t5-small-q2_k` | 0.287 | 0.660 |

Per-dataset generation metrics:

| dataset | model | exact match vs reference | chrF vs reference | agreement vs f32 | similarity vs f32 |
| ------- | ----- | -----------------------: | ----------------: | ---------------: | ----------------: |
| CoLA | `t5-small-f16` | 0.697 | 0.950 | 1.000 | 1.000 |
| CoLA | `t5-small-f32` | 0.697 | 0.950 | - | - |
| CoLA | `t5-small-q2_k` | 0.697 | 0.950 | 1.000 | 1.000 |
| CoLA | `t5-small-q3_k_m` | 0.697 | 0.949 | 1.000 | 1.000 |
| CoLA | `t5-small-q4_0` | 0.697 | 0.950 | 0.995 | 1.000 |
| CoLA | `t5-small-q4_k_m` | 0.698 | 0.950 | 0.999 | 1.000 |
| CoLA | `t5-small-q5_k_m` | 0.697 | 0.950 | 1.000 | 1.000 |
| CoLA | `t5-small-q8_0` | 0.697 | 0.950 | 1.000 | 1.000 |
| summarization | `t5-small-f16` | 0.000 | 0.133 | 0.979 | 0.995 |
| summarization | `t5-small-f32` | 0.000 | 0.133 | - | - |
| summarization | `t5-small-q2_k` | 0.000 | 0.068 | 0.000 | 0.254 |
| summarization | `t5-small-q3_k_m` | 0.000 | 0.123 | 0.039 | 0.510 |
| summarization | `t5-small-q4_0` | 0.000 | 0.123 | 0.071 | 0.550 |
| summarization | `t5-small-q4_k_m` | 0.000 | 0.131 | 0.137 | 0.642 |
| summarization | `t5-small-q5_k_m` | 0.000 | 0.128 | 0.210 | 0.689 |
| summarization | `t5-small-q8_0` | 0.000 | 0.133 | 0.541 | 0.852 |
| translation en-de | `t5-small-f16` | 0.020 | 0.361 | 0.989 | 0.999 |
| translation en-de | `t5-small-f32` | 0.020 | 0.361 | - | - |
| translation en-de | `t5-small-q2_k` | 0.015 | 0.315 | 0.090 | 0.738 |
| translation en-de | `t5-small-q3_k_m` | 0.018 | 0.353 | 0.234 | 0.876 |
| translation en-de | `t5-small-q4_0` | 0.019 | 0.357 | 0.304 | 0.905 |
| translation en-de | `t5-small-q4_k_m` | 0.019 | 0.359 | 0.380 | 0.920 |
| translation en-de | `t5-small-q5_k_m` | 0.019 | 0.359 | 0.448 | 0.935 |
| translation en-de | `t5-small-q8_0` | 0.019 | 0.360 | 0.680 | 0.970 |
| translation en-fr | `t5-small-f16` | 0.017 | 0.381 | 0.993 | 0.999 |
| translation en-fr | `t5-small-f32` | 0.017 | 0.381 | - | - |
| translation en-fr | `t5-small-q2_k` | 0.007 | 0.276 | 0.057 | 0.646 |
| translation en-fr | `t5-small-q3_k_m` | 0.015 | 0.368 | 0.226 | 0.868 |
| translation en-fr | `t5-small-q4_0` | 0.015 | 0.372 | 0.299 | 0.891 |
| translation en-fr | `t5-small-q4_k_m` | 0.017 | 0.377 | 0.380 | 0.919 |
| translation en-fr | `t5-small-q5_k_m` | 0.016 | 0.380 | 0.446 | 0.933 |
| translation en-fr | `t5-small-q8_0` | 0.016 | 0.380 | 0.672 | 0.967 |

Interpretation:

- `f16` is effectively equivalent to `f32` for generated outputs.
- `q8_0` preserves most behavior but still diverges on longer-form tasks.
- `q5_k_m` and `q4_k_m` are usable middle points depending on size and quality
  target.
- `q2_k` degrades heavily for summarization and translation.

## Perplexity And KL Results

Perplexity is reported per dataset. KL/token and top-1 disagreement are the
main quantization drift metrics because they compare each quantized model
directly against f32 token distributions.

Token-weighted summary across all datasets:

| model | tokens | KL/token | top-1 disagree |
| ----- | -----: | -------: | -------------: |
| `t5-small-f16` | 308,028 | 0.00000 | 0.0005 |
| `t5-small-f32` | 308,028 | - | - |
| `t5-small-q8_0` | 308,028 | 0.00187 | 0.0160 |
| `t5-small-q5_k_m` | 308,028 | 0.01004 | 0.0386 |
| `t5-small-q4_k_m` | 308,028 | 0.02038 | 0.0521 |
| `t5-small-q4_0` | 308,028 | 0.04847 | 0.0704 |
| `t5-small-q3_k_m` | 308,028 | 0.05892 | 0.0897 |
| `t5-small-q2_k` | 308,028 | 0.27523 | 0.1914 |

Per-dataset perplexity:

| model | CoLA | summarization | translation en-de | translation en-fr |
| ----- | ---: | ------------: | ----------------: | ----------------: |
| `t5-small-f32` | 1.3490 | 138.5925 | 5.0317 | 3.8267 |
| `t5-small-f16` | 1.3491 | 138.6029 | 5.0317 | 3.8268 |
| `t5-small-q8_0` | 1.3494 | 133.1739 | 5.0314 | 3.8245 |
| `t5-small-q5_k_m` | 1.3498 | 139.2235 | 5.0748 | 3.8488 |
| `t5-small-q4_k_m` | 1.3535 | 155.2379 | 5.1135 | 3.8759 |
| `t5-small-q4_0` | 1.3593 | 215.7687 | 5.1394 | 3.9305 |
| `t5-small-q3_k_m` | 1.3490 | 153.6497 | 5.2163 | 3.9680 |
| `t5-small-q2_k` | 1.3577 | 262.6867 | 6.0281 | 4.4851 |

Per-dataset KL/token:

| model | CoLA | summarization | translation en-de | translation en-fr |
| ----- | ---: | ------------: | ----------------: | ----------------: |
| `t5-small-f16` | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| `t5-small-q8_0` | 0.00029 | 0.00194 | 0.00191 | 0.00181 |
| `t5-small-q5_k_m` | 0.00544 | 0.01159 | 0.00923 | 0.00838 |
| `t5-small-q4_k_m` | 0.00811 | 0.02593 | 0.01732 | 0.01437 |
| `t5-small-q4_0` | 0.01239 | 0.07497 | 0.02886 | 0.02339 |
| `t5-small-q3_k_m` | 0.00539 | 0.07696 | 0.04827 | 0.04073 |
| `t5-small-q2_k` | 0.00350 | 0.36274 | 0.22476 | 0.18650 |

Interpretation:

- The KL ranking is stable and clear: `f16`, `q8_0`, `q5_k_m`, `q4_k_m`,
  `q4_0`, `q3_k_m`, then `q2_k`.
- `q8_0` has very small distributional drift from f32.
- `q5_k_m` is the strongest compact quantization in this run.
- `q4_k_m` is materially better than `q4_0` by KL/token and top-1
  disagreement.
- `q2_k` has high drift and large top-1 disagreement on generation-heavy
  datasets.

## Recommended Default

For T5-small in this workflow:

- Use `t5-small-f32.gguf` as the reference baseline.
- Use `t5-small-q8_0.gguf` when preserving behavior matters most.
- Use `t5-small-q5_k_m.gguf` as the best compact default from this run.
- Use `t5-small-q4_k_m.gguf` only when size pressure is stronger than quality.
- Avoid `t5-small-q2_k.gguf` for summarization or translation quality checks.

GOOGLE T5-small License: Apache 2.0
We followed and adopted their licnese.