HinoMoto-Sarashina2.2-3B-sft-v3 (GGUF, multi-quant)

QLoRA fine-tune of Sarashina2.2-3B-instruct on HinoMoto sft_v2 dataset (4 axes: family / keigo / silence / atmosphere).

📊 Public Benchmark Profile (4 tasks, 4-shot)

Model JCQA (1115q) JNLI (200q) Morality (200q) JEMHopQA (116q) Avg
Sarashina2-3B base (HF) 90.22% 67.00% 78.00% 43.10% 69.6
HinoMoto-3B sft_v3 merged (HF) 89.42% 52.50% ⚠ 78.50% 38.79% 64.8
Q6_K GGUF 91.48% 45.50% 83.50% 43.97% 66.1
Q5_K_M GGUF 91.03% 44.00% 83.50% 48.28% 66.7
Q4_K_M GGUF 90.67% 43.50% 85.00% 48.28% 66.9
Q3_K_M GGUF 87.09% 44.00% 81.50% 40.52% 63.3
llm-jp-3-1.8b-instruct 19.82% 58.00% 50.00% 16.38% 36.1

Honest claims:

  • ✅ JCQA で base 同等保持 (89-91%)
  • ✅ Morality / JEMHopQA で llm-jp-1.8b を +28pt 圧倒 (平均)
  • ⚠️ JNLI で -14.5pt 退行 — SFT による catastrophic forgetting 兆候 (next sft_v4 で curriculum 改善)
  • 🤯 量子化 task-dependent: silence→Q3, JCQA/JNLI→Q6, Morality/JEMHop→Q4_K_M (万能 quant なし)

⚠️ llm-jp-3-1.8b 数値は同 prompt format 下のもの. 公式 llm-jp-eval framework 再評価 TODO.

🎯 HinoMoto-Bench-ja (4 軸独自) — 5 quants × 3 axes × 3 seeds

Variant BPW Size family /12 keigo % silence % Best for
LoRA bf16 (n=1) 16.00 adapter 8.34 28.6 46.0 reference
Merge fp16 (n=1) 16.00 6.7 GB 8.29 37.1 44.0 reference
Q6_K (3s) 6.60 2.6 GB 8.10 ± 0.17 33.3 ± 5.8 31.3 ± 4.2 (skip — Q3 better)
Q5_K_M (3s) 5.72 2.3 GB 8.28 ± 0.10 32.4 ± 4.4 38.7 ± 5.0 general mix
Q4_K_M (3s) 4.92 2.0 GB 8.25 ± 0.22 31.9 ± 4.1 29.3 ± 3.1 family / keigo (default)
Q3_K_M (3s) ⭐ 3.91 1.6 GB 8.00 ± 0.16 32.9 ± 3.8 42.0 ± 3.5 all-axis best

🤯 Counter-intuitive finding

Q3_K_M (lowest bit, smallest size) is BEST overall:

  • family: tied with Q4/Q5 (within sampling noise, std 0.10-0.22)
  • keigo: tied with Q4/Q5 (within sampling noise, std 3.8-5.8)
  • silence: Q3 (42.0) > Q5 (38.7) > Q6 (31.3) ≈ Q4 (29.3) — non-monotonic!

The "more bits = better quality" rule is decisively false for this model. K-quant variants use different M-block groupings; silence-critical weights happen to be preserved by Q3_K_M's structure but destroyed by Q4_K_M's.

See docs/Q4_QUANT_VERIFICATION.md and Note article #8 for full analysis.

🎯 Recommendation

Use case Recommended file Size Why
Edge / mobile (general) Q3_K_M 1.6 GB smallest + best silence, family/keigo tied
Desktop / server Q5_K_M 2.3 GB safe middle (silence 39%, family/keigo same)
Maximum quality LoRA + base or fp16 GGUF 6.3-6.7 GB reference (silence 44-46%)

Note: Q4_K_M is the llama.cpp default but performs worse on silence than the smaller Q3_K_M for our model. Don't blindly trust "Q4_K_M is safe".

How to use (llama.cpp)

# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Recommended: Q3_K_M for best all-axis quality
huggingface-cli download FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf \
  --include "sft_v3_3b_Q3_K_M.gguf" --local-dir .

# Or via llama-server (HTTP API, fast batch)
./build/bin/llama-server -m sft_v3_3b_Q3_K_M.gguf -ngl 99 --port 8080 -t 4 -c 2048
# Quick generation
./build/bin/llama-cli -m sft_v3_3b_Q3_K_M.gguf -p "今日は" -n 100 -ngl 99

Training

  • Base: Sarashina2.2-3B-instruct (SoftBank, MIT)
  • Method: QLoRA 4bit
  • Dataset: HinoMoto sft_v2 (401 samples, 4 axes)
  • Steps: 100 (early-stop optimal — 200 steps over-trains silence)
  • LoRA rank: 16, alpha: 32, target: q_proj/k_proj/v_proj/o_proj
  • All quants: llama.cpp/llama-quantize

Files

File Size Compression vs fp16 Notes
sft_v3_3b_Q3_K_M.gguf 1.6 GB 3.94x ⭐ all-axis best, smallest
sft_v3_3b_Q4_K_M.gguf 2.0 GB 3.15x llama.cpp default; silence -15pt
sft_v3_3b_Q5_K_M.gguf 2.3 GB 2.74x safe middle
sft_v3_3b_Q6_K.gguf 2.6 GB 2.42x (Q5 usually better)

Inference performance (RTX 3090, llama-server)

Quant tokens/sec
Q3_K_M ~110 t/s
Q4_K_M ~102 t/s
Q5_K_M ~95 t/s
Q6_K ~88 t/s

License

MIT (inheriting Sarashina2.2-3B-instruct)

Citation

Coming soon (HinoMoto arXiv WIP). See paper_drafts/hinomoto_arxiv_outline.md §4.4.

Related

Downloads last month
32
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf