Instructions to use noumenalabs/t5-small-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use noumenalabs/t5-small-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="noumenalabs/t5-small-gguf",
	filename="t5-small-f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use noumenalabs/t5-small-gguf with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf noumenalabs/t5-small-gguf:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf noumenalabs/t5-small-gguf:Q4_K_M

Use Docker

docker model run hf.co/noumenalabs/t5-small-gguf:Q4_K_M

LM Studio
Jan
Ollama
How to use noumenalabs/t5-small-gguf with Ollama:
```
ollama run hf.co/noumenalabs/t5-small-gguf:Q4_K_M
```

Unsloth Studio

How to use noumenalabs/t5-small-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for noumenalabs/t5-small-gguf to start chatting

Atomic Chat new
Docker Model Runner
How to use noumenalabs/t5-small-gguf with Docker Model Runner:
```
docker model run hf.co/noumenalabs/t5-small-gguf:Q4_K_M
```

Lemonade

How to use noumenalabs/t5-small-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull noumenalabs/t5-small-gguf:Q4_K_M

Run and chat with the model

lemonade run user.t5-small-gguf-Q4_K_M

List all available models

lemonade list

t5-small-gguf / README.md

constannnnt

Update README.md

222e769 verified about 1 month ago

preview code

Raw

History Blame Contribute Delete

7.69 kB

	# T5 GGUF Analysis

	This document records the T5-small GGUF evaluation run.

	## Environment

	Verified runtime:

	\| item \| value \|
	\| ---- \| ----- \|
	\| Python \| `3.11.12` \|
	\| Torch \| `2.9.0+cu129` \|
	\| Torch CUDA \| `12.9` \|
	\| CUDA available \| `True` \|
	\| GPU \| `NVIDIA GeForce RTX 3070 Laptop GPU` \|

	## Models

	The run evaluated these GGUFs:

	\| model \| role \|
	\| ----- \| ---- \|
	\| `t5-small-f32.gguf` \| unquantized reference baseline \|
	\| `t5-small-f16.gguf` \| high-precision comparison and quantization source \|
	\| `t5-small-q8_0.gguf` \| quantized \|
	\| `t5-small-q5_k_m.gguf` \| quantized \|
	\| `t5-small-q4_k_m.gguf` \| quantized \|
	\| `t5-small-q4_0.gguf` \| quantized \|
	\| `t5-small-q3_k_m.gguf` \| quantized \|
	\| `t5-small-q2_k.gguf` \| quantized \|

	## Conversion Check Results

	The conversion check compares greedy HF outputs against greedy f32 GGUF outputs.
	It validates that the unquantized GGUF is a usable reference before comparing
	quantized models against it.

	\| dataset \| examples \| exact match \| chrF \| first token match \|
	\| ------- \| -------: \| ----------: \| ---: \| ----------------: \|
	\| CoLA \| 2,000 \| 1.000 \| 1.000 \| 1.000 \|
	\| summarization \| 2,000 \| 0.117 \| 0.953 \| 0.990 \|
	\| translation en-de \| 2,000 \| 0.993 \| 0.996 \| 1.000 \|
	\| translation en-fr \| 2,000 \| 0.986 \| 0.995 \| 1.000 \|
	\| overall \| 8,000 \| 0.774 \| 0.986 \| 0.997 \|

	Interpretation:

	- The f32 GGUF tracks HF closely overall.
	- Summarization has low exact match but high chrF, which points to wording
	differences rather than broad conversion drift.
	- Translation and CoLA are effectively matching at the output level.

	## Generation Results

	Generation used greedy decoding with `n_predict=64`. Agreement and similarity
	are measured against the f32 GGUF baseline output.

	\| model \| agreement vs f32 \| similarity vs f32 \|
	\| ----- \| ---------------: \| ----------------: \|
	\| `t5-small-f16` \| 0.990 \| 0.998 \|
	\| `t5-small-q8_0` \| 0.723 \| 0.947 \|
	\| `t5-small-q5_k_m` \| 0.526 \| 0.889 \|
	\| `t5-small-q4_k_m` \| 0.474 \| 0.870 \|
	\| `t5-small-q4_0` \| 0.417 \| 0.837 \|
	\| `t5-small-q3_k_m` \| 0.375 \| 0.814 \|
	\| `t5-small-q2_k` \| 0.287 \| 0.660 \|

	Per-dataset generation metrics:

	\| dataset \| model \| exact match vs reference \| chrF vs reference \| agreement vs f32 \| similarity vs f32 \|
	\| ------- \| ----- \| -----------------------: \| ----------------: \| ---------------: \| ----------------: \|
	\| CoLA \| `t5-small-f16` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-f32` \| 0.697 \| 0.950 \| - \| - \|
	\| CoLA \| `t5-small-q2_k` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q3_k_m` \| 0.697 \| 0.949 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q4_0` \| 0.697 \| 0.950 \| 0.995 \| 1.000 \|
	\| CoLA \| `t5-small-q4_k_m` \| 0.698 \| 0.950 \| 0.999 \| 1.000 \|
	\| CoLA \| `t5-small-q5_k_m` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q8_0` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| summarization \| `t5-small-f16` \| 0.000 \| 0.133 \| 0.979 \| 0.995 \|
	\| summarization \| `t5-small-f32` \| 0.000 \| 0.133 \| - \| - \|
	\| summarization \| `t5-small-q2_k` \| 0.000 \| 0.068 \| 0.000 \| 0.254 \|
	\| summarization \| `t5-small-q3_k_m` \| 0.000 \| 0.123 \| 0.039 \| 0.510 \|
	\| summarization \| `t5-small-q4_0` \| 0.000 \| 0.123 \| 0.071 \| 0.550 \|
	\| summarization \| `t5-small-q4_k_m` \| 0.000 \| 0.131 \| 0.137 \| 0.642 \|
	\| summarization \| `t5-small-q5_k_m` \| 0.000 \| 0.128 \| 0.210 \| 0.689 \|
	\| summarization \| `t5-small-q8_0` \| 0.000 \| 0.133 \| 0.541 \| 0.852 \|
	\| translation en-de \| `t5-small-f16` \| 0.020 \| 0.361 \| 0.989 \| 0.999 \|
	\| translation en-de \| `t5-small-f32` \| 0.020 \| 0.361 \| - \| - \|
	\| translation en-de \| `t5-small-q2_k` \| 0.015 \| 0.315 \| 0.090 \| 0.738 \|
	\| translation en-de \| `t5-small-q3_k_m` \| 0.018 \| 0.353 \| 0.234 \| 0.876 \|
	\| translation en-de \| `t5-small-q4_0` \| 0.019 \| 0.357 \| 0.304 \| 0.905 \|
	\| translation en-de \| `t5-small-q4_k_m` \| 0.019 \| 0.359 \| 0.380 \| 0.920 \|
	\| translation en-de \| `t5-small-q5_k_m` \| 0.019 \| 0.359 \| 0.448 \| 0.935 \|
	\| translation en-de \| `t5-small-q8_0` \| 0.019 \| 0.360 \| 0.680 \| 0.970 \|
	\| translation en-fr \| `t5-small-f16` \| 0.017 \| 0.381 \| 0.993 \| 0.999 \|
	\| translation en-fr \| `t5-small-f32` \| 0.017 \| 0.381 \| - \| - \|
	\| translation en-fr \| `t5-small-q2_k` \| 0.007 \| 0.276 \| 0.057 \| 0.646 \|
	\| translation en-fr \| `t5-small-q3_k_m` \| 0.015 \| 0.368 \| 0.226 \| 0.868 \|
	\| translation en-fr \| `t5-small-q4_0` \| 0.015 \| 0.372 \| 0.299 \| 0.891 \|
	\| translation en-fr \| `t5-small-q4_k_m` \| 0.017 \| 0.377 \| 0.380 \| 0.919 \|
	\| translation en-fr \| `t5-small-q5_k_m` \| 0.016 \| 0.380 \| 0.446 \| 0.933 \|
	\| translation en-fr \| `t5-small-q8_0` \| 0.016 \| 0.380 \| 0.672 \| 0.967 \|

	Interpretation:

	- `f16` is effectively equivalent to `f32` for generated outputs.
	- `q8_0` preserves most behavior but still diverges on longer-form tasks.
	- `q5_k_m` and `q4_k_m` are usable middle points depending on size and quality
	target.
	- `q2_k` degrades heavily for summarization and translation.

	## Perplexity And KL Results

	Perplexity is reported per dataset. KL/token and top-1 disagreement are the
	main quantization drift metrics because they compare each quantized model
	directly against f32 token distributions.

	Token-weighted summary across all datasets:

	\| model \| tokens \| KL/token \| top-1 disagree \|
	\| ----- \| -----: \| -------: \| -------------: \|
	\| `t5-small-f16` \| 308,028 \| 0.00000 \| 0.0005 \|
	\| `t5-small-f32` \| 308,028 \| - \| - \|
	\| `t5-small-q8_0` \| 308,028 \| 0.00187 \| 0.0160 \|
	\| `t5-small-q5_k_m` \| 308,028 \| 0.01004 \| 0.0386 \|
	\| `t5-small-q4_k_m` \| 308,028 \| 0.02038 \| 0.0521 \|
	\| `t5-small-q4_0` \| 308,028 \| 0.04847 \| 0.0704 \|
	\| `t5-small-q3_k_m` \| 308,028 \| 0.05892 \| 0.0897 \|
	\| `t5-small-q2_k` \| 308,028 \| 0.27523 \| 0.1914 \|

	Per-dataset perplexity:

	\| model \| CoLA \| summarization \| translation en-de \| translation en-fr \|
	\| ----- \| ---: \| ------------: \| ----------------: \| ----------------: \|
	\| `t5-small-f32` \| 1.3490 \| 138.5925 \| 5.0317 \| 3.8267 \|
	\| `t5-small-f16` \| 1.3491 \| 138.6029 \| 5.0317 \| 3.8268 \|
	\| `t5-small-q8_0` \| 1.3494 \| 133.1739 \| 5.0314 \| 3.8245 \|
	\| `t5-small-q5_k_m` \| 1.3498 \| 139.2235 \| 5.0748 \| 3.8488 \|
	\| `t5-small-q4_k_m` \| 1.3535 \| 155.2379 \| 5.1135 \| 3.8759 \|
	\| `t5-small-q4_0` \| 1.3593 \| 215.7687 \| 5.1394 \| 3.9305 \|
	\| `t5-small-q3_k_m` \| 1.3490 \| 153.6497 \| 5.2163 \| 3.9680 \|
	\| `t5-small-q2_k` \| 1.3577 \| 262.6867 \| 6.0281 \| 4.4851 \|

	Per-dataset KL/token:

	\| model \| CoLA \| summarization \| translation en-de \| translation en-fr \|
	\| ----- \| ---: \| ------------: \| ----------------: \| ----------------: \|
	\| `t5-small-f16` \| 0.00000 \| 0.00000 \| 0.00000 \| 0.00000 \|
	\| `t5-small-q8_0` \| 0.00029 \| 0.00194 \| 0.00191 \| 0.00181 \|
	\| `t5-small-q5_k_m` \| 0.00544 \| 0.01159 \| 0.00923 \| 0.00838 \|
	\| `t5-small-q4_k_m` \| 0.00811 \| 0.02593 \| 0.01732 \| 0.01437 \|
	\| `t5-small-q4_0` \| 0.01239 \| 0.07497 \| 0.02886 \| 0.02339 \|
	\| `t5-small-q3_k_m` \| 0.00539 \| 0.07696 \| 0.04827 \| 0.04073 \|
	\| `t5-small-q2_k` \| 0.00350 \| 0.36274 \| 0.22476 \| 0.18650 \|

	Interpretation:

	- The KL ranking is stable and clear: `f16`, `q8_0`, `q5_k_m`, `q4_k_m`,
	`q4_0`, `q3_k_m`, then `q2_k`.
	- `q8_0` has very small distributional drift from f32.
	- `q5_k_m` is the strongest compact quantization in this run.
	- `q4_k_m` is materially better than `q4_0` by KL/token and top-1
	disagreement.
	- `q2_k` has high drift and large top-1 disagreement on generation-heavy
	datasets.

	## Recommended Default

	For T5-small in this workflow:

	- Use `t5-small-f32.gguf` as the reference baseline.
	- Use `t5-small-q8_0.gguf` when preserving behavior matters most.
	- Use `t5-small-q5_k_m.gguf` as the best compact default from this run.
	- Use `t5-small-q4_k_m.gguf` only when size pressure is stronger than quality.
	- Avoid `t5-small-q2_k.gguf` for summarization or translation quality checks.

	GOOGLE T5-small License: Apache 2.0
	We followed and adopted their licnese.

	# T5 GGUF Analysis

	This document records the T5-small GGUF evaluation run.

	## Environment

	Verified runtime:

	\| item \| value \|
	\| ---- \| ----- \|
	\| Python \| `3.11.12` \|
	\| Torch \| `2.9.0+cu129` \|
	\| Torch CUDA \| `12.9` \|
	\| CUDA available \| `True` \|
	\| GPU \| `NVIDIA GeForce RTX 3070 Laptop GPU` \|

	## Models

	The run evaluated these GGUFs:

	\| model \| role \|
	\| ----- \| ---- \|
	\| `t5-small-f32.gguf` \| unquantized reference baseline \|
	\| `t5-small-f16.gguf` \| high-precision comparison and quantization source \|
	\| `t5-small-q8_0.gguf` \| quantized \|
	\| `t5-small-q5_k_m.gguf` \| quantized \|
	\| `t5-small-q4_k_m.gguf` \| quantized \|
	\| `t5-small-q4_0.gguf` \| quantized \|
	\| `t5-small-q3_k_m.gguf` \| quantized \|
	\| `t5-small-q2_k.gguf` \| quantized \|

	## Conversion Check Results

	The conversion check compares greedy HF outputs against greedy f32 GGUF outputs.
	It validates that the unquantized GGUF is a usable reference before comparing
	quantized models against it.

	\| dataset \| examples \| exact match \| chrF \| first token match \|
	\| ------- \| -------: \| ----------: \| ---: \| ----------------: \|
	\| CoLA \| 2,000 \| 1.000 \| 1.000 \| 1.000 \|
	\| summarization \| 2,000 \| 0.117 \| 0.953 \| 0.990 \|
	\| translation en-de \| 2,000 \| 0.993 \| 0.996 \| 1.000 \|
	\| translation en-fr \| 2,000 \| 0.986 \| 0.995 \| 1.000 \|
	\| overall \| 8,000 \| 0.774 \| 0.986 \| 0.997 \|

	Interpretation:

	- The f32 GGUF tracks HF closely overall.
	- Summarization has low exact match but high chrF, which points to wording
	differences rather than broad conversion drift.
	- Translation and CoLA are effectively matching at the output level.

	## Generation Results

	Generation used greedy decoding with `n_predict=64`. Agreement and similarity
	are measured against the f32 GGUF baseline output.

	\| model \| agreement vs f32 \| similarity vs f32 \|
	\| ----- \| ---------------: \| ----------------: \|
	\| `t5-small-f16` \| 0.990 \| 0.998 \|
	\| `t5-small-q8_0` \| 0.723 \| 0.947 \|
	\| `t5-small-q5_k_m` \| 0.526 \| 0.889 \|
	\| `t5-small-q4_k_m` \| 0.474 \| 0.870 \|
	\| `t5-small-q4_0` \| 0.417 \| 0.837 \|
	\| `t5-small-q3_k_m` \| 0.375 \| 0.814 \|
	\| `t5-small-q2_k` \| 0.287 \| 0.660 \|

	Per-dataset generation metrics:

	\| dataset \| model \| exact match vs reference \| chrF vs reference \| agreement vs f32 \| similarity vs f32 \|
	\| ------- \| ----- \| -----------------------: \| ----------------: \| ---------------: \| ----------------: \|
	\| CoLA \| `t5-small-f16` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-f32` \| 0.697 \| 0.950 \| - \| - \|
	\| CoLA \| `t5-small-q2_k` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q3_k_m` \| 0.697 \| 0.949 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q4_0` \| 0.697 \| 0.950 \| 0.995 \| 1.000 \|
	\| CoLA \| `t5-small-q4_k_m` \| 0.698 \| 0.950 \| 0.999 \| 1.000 \|
	\| CoLA \| `t5-small-q5_k_m` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| CoLA \| `t5-small-q8_0` \| 0.697 \| 0.950 \| 1.000 \| 1.000 \|
	\| summarization \| `t5-small-f16` \| 0.000 \| 0.133 \| 0.979 \| 0.995 \|
	\| summarization \| `t5-small-f32` \| 0.000 \| 0.133 \| - \| - \|
	\| summarization \| `t5-small-q2_k` \| 0.000 \| 0.068 \| 0.000 \| 0.254 \|
	\| summarization \| `t5-small-q3_k_m` \| 0.000 \| 0.123 \| 0.039 \| 0.510 \|
	\| summarization \| `t5-small-q4_0` \| 0.000 \| 0.123 \| 0.071 \| 0.550 \|
	\| summarization \| `t5-small-q4_k_m` \| 0.000 \| 0.131 \| 0.137 \| 0.642 \|
	\| summarization \| `t5-small-q5_k_m` \| 0.000 \| 0.128 \| 0.210 \| 0.689 \|
	\| summarization \| `t5-small-q8_0` \| 0.000 \| 0.133 \| 0.541 \| 0.852 \|
	\| translation en-de \| `t5-small-f16` \| 0.020 \| 0.361 \| 0.989 \| 0.999 \|
	\| translation en-de \| `t5-small-f32` \| 0.020 \| 0.361 \| - \| - \|
	\| translation en-de \| `t5-small-q2_k` \| 0.015 \| 0.315 \| 0.090 \| 0.738 \|
	\| translation en-de \| `t5-small-q3_k_m` \| 0.018 \| 0.353 \| 0.234 \| 0.876 \|
	\| translation en-de \| `t5-small-q4_0` \| 0.019 \| 0.357 \| 0.304 \| 0.905 \|
	\| translation en-de \| `t5-small-q4_k_m` \| 0.019 \| 0.359 \| 0.380 \| 0.920 \|
	\| translation en-de \| `t5-small-q5_k_m` \| 0.019 \| 0.359 \| 0.448 \| 0.935 \|
	\| translation en-de \| `t5-small-q8_0` \| 0.019 \| 0.360 \| 0.680 \| 0.970 \|
	\| translation en-fr \| `t5-small-f16` \| 0.017 \| 0.381 \| 0.993 \| 0.999 \|
	\| translation en-fr \| `t5-small-f32` \| 0.017 \| 0.381 \| - \| - \|
	\| translation en-fr \| `t5-small-q2_k` \| 0.007 \| 0.276 \| 0.057 \| 0.646 \|
	\| translation en-fr \| `t5-small-q3_k_m` \| 0.015 \| 0.368 \| 0.226 \| 0.868 \|
	\| translation en-fr \| `t5-small-q4_0` \| 0.015 \| 0.372 \| 0.299 \| 0.891 \|
	\| translation en-fr \| `t5-small-q4_k_m` \| 0.017 \| 0.377 \| 0.380 \| 0.919 \|
	\| translation en-fr \| `t5-small-q5_k_m` \| 0.016 \| 0.380 \| 0.446 \| 0.933 \|
	\| translation en-fr \| `t5-small-q8_0` \| 0.016 \| 0.380 \| 0.672 \| 0.967 \|

	Interpretation:

	- `f16` is effectively equivalent to `f32` for generated outputs.
	- `q8_0` preserves most behavior but still diverges on longer-form tasks.
	- `q5_k_m` and `q4_k_m` are usable middle points depending on size and quality
	target.
	- `q2_k` degrades heavily for summarization and translation.

	## Perplexity And KL Results

	Perplexity is reported per dataset. KL/token and top-1 disagreement are the
	main quantization drift metrics because they compare each quantized model
	directly against f32 token distributions.

	Token-weighted summary across all datasets:

	\| model \| tokens \| KL/token \| top-1 disagree \|
	\| ----- \| -----: \| -------: \| -------------: \|
	\| `t5-small-f16` \| 308,028 \| 0.00000 \| 0.0005 \|
	\| `t5-small-f32` \| 308,028 \| - \| - \|
	\| `t5-small-q8_0` \| 308,028 \| 0.00187 \| 0.0160 \|
	\| `t5-small-q5_k_m` \| 308,028 \| 0.01004 \| 0.0386 \|
	\| `t5-small-q4_k_m` \| 308,028 \| 0.02038 \| 0.0521 \|
	\| `t5-small-q4_0` \| 308,028 \| 0.04847 \| 0.0704 \|
	\| `t5-small-q3_k_m` \| 308,028 \| 0.05892 \| 0.0897 \|
	\| `t5-small-q2_k` \| 308,028 \| 0.27523 \| 0.1914 \|

	Per-dataset perplexity:

	\| model \| CoLA \| summarization \| translation en-de \| translation en-fr \|
	\| ----- \| ---: \| ------------: \| ----------------: \| ----------------: \|
	\| `t5-small-f32` \| 1.3490 \| 138.5925 \| 5.0317 \| 3.8267 \|
	\| `t5-small-f16` \| 1.3491 \| 138.6029 \| 5.0317 \| 3.8268 \|
	\| `t5-small-q8_0` \| 1.3494 \| 133.1739 \| 5.0314 \| 3.8245 \|
	\| `t5-small-q5_k_m` \| 1.3498 \| 139.2235 \| 5.0748 \| 3.8488 \|
	\| `t5-small-q4_k_m` \| 1.3535 \| 155.2379 \| 5.1135 \| 3.8759 \|
	\| `t5-small-q4_0` \| 1.3593 \| 215.7687 \| 5.1394 \| 3.9305 \|
	\| `t5-small-q3_k_m` \| 1.3490 \| 153.6497 \| 5.2163 \| 3.9680 \|
	\| `t5-small-q2_k` \| 1.3577 \| 262.6867 \| 6.0281 \| 4.4851 \|

	Per-dataset KL/token:

	\| model \| CoLA \| summarization \| translation en-de \| translation en-fr \|
	\| ----- \| ---: \| ------------: \| ----------------: \| ----------------: \|
	\| `t5-small-f16` \| 0.00000 \| 0.00000 \| 0.00000 \| 0.00000 \|
	\| `t5-small-q8_0` \| 0.00029 \| 0.00194 \| 0.00191 \| 0.00181 \|
	\| `t5-small-q5_k_m` \| 0.00544 \| 0.01159 \| 0.00923 \| 0.00838 \|
	\| `t5-small-q4_k_m` \| 0.00811 \| 0.02593 \| 0.01732 \| 0.01437 \|
	\| `t5-small-q4_0` \| 0.01239 \| 0.07497 \| 0.02886 \| 0.02339 \|
	\| `t5-small-q3_k_m` \| 0.00539 \| 0.07696 \| 0.04827 \| 0.04073 \|
	\| `t5-small-q2_k` \| 0.00350 \| 0.36274 \| 0.22476 \| 0.18650 \|

	Interpretation:

	- The KL ranking is stable and clear: `f16`, `q8_0`, `q5_k_m`, `q4_k_m`,
	`q4_0`, `q3_k_m`, then `q2_k`.
	- `q8_0` has very small distributional drift from f32.
	- `q5_k_m` is the strongest compact quantization in this run.
	- `q4_k_m` is materially better than `q4_0` by KL/token and top-1
	disagreement.
	- `q2_k` has high drift and large top-1 disagreement on generation-heavy
	datasets.

	## Recommended Default

	For T5-small in this workflow:

	- Use `t5-small-f32.gguf` as the reference baseline.
	- Use `t5-small-q8_0.gguf` when preserving behavior matters most.
	- Use `t5-small-q5_k_m.gguf` as the best compact default from this run.
	- Use `t5-small-q4_k_m.gguf` only when size pressure is stronger than quality.
	- Avoid `t5-small-q2_k.gguf` for summarization or translation quality checks.

	GOOGLE T5-small License: Apache 2.0
	We followed and adopted their licnese.