Instructions to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf", filename="sft_v3_3b_Q3_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Use Docker
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Ollama:
ollama run hf.co/FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
- Unsloth Studio
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf to start chatting
- Pi
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Docker Model Runner:
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
- Lemonade
How to use FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf:Q4_K_M
Run and chat with the model
lemonade run user.sarashina2.2-3b-sft-v3-Q4_K_M-gguf-Q4_K_M
List all available models
lemonade list
HinoMoto-Sarashina2.2-3B-sft-v3 (GGUF, multi-quant)
QLoRA fine-tune of Sarashina2.2-3B-instruct on HinoMoto sft_v2 dataset (4 axes: family / keigo / silence / atmosphere).
📊 Public Benchmark Profile (4 tasks, 4-shot)
| Model | JCQA (1115q) | JNLI (200q) | Morality (200q) | JEMHopQA (116q) | Avg |
|---|---|---|---|---|---|
| Sarashina2-3B base (HF) | 90.22% | 67.00% | 78.00% | 43.10% | 69.6 |
| HinoMoto-3B sft_v3 merged (HF) | 89.42% | 52.50% ⚠ | 78.50% | 38.79% | 64.8 |
| Q6_K GGUF | 91.48% | 45.50% | 83.50% | 43.97% | 66.1 |
| Q5_K_M GGUF | 91.03% | 44.00% | 83.50% | 48.28% | 66.7 |
| Q4_K_M GGUF | 90.67% | 43.50% | 85.00% | 48.28% | 66.9 |
| Q3_K_M GGUF | 87.09% | 44.00% | 81.50% | 40.52% | 63.3 |
| llm-jp-3-1.8b-instruct | 19.82% | 58.00% | 50.00% | 16.38% | 36.1 |
Honest claims:
- ✅ JCQA で base 同等保持 (89-91%)
- ✅ Morality / JEMHopQA で llm-jp-1.8b を +28pt 圧倒 (平均)
- ⚠️ JNLI で -14.5pt 退行 — SFT による catastrophic forgetting 兆候 (next sft_v4 で curriculum 改善)
- 🤯 量子化 task-dependent: silence→Q3, JCQA/JNLI→Q6, Morality/JEMHop→Q4_K_M (万能 quant なし)
⚠️ llm-jp-3-1.8b 数値は同 prompt format 下のもの. 公式 llm-jp-eval framework 再評価 TODO.
🎯 HinoMoto-Bench-ja (4 軸独自) — 5 quants × 3 axes × 3 seeds
| Variant | BPW | Size | family /12 | keigo % | silence % | Best for |
|---|---|---|---|---|---|---|
| LoRA bf16 (n=1) | 16.00 | adapter | 8.34 | 28.6 | 46.0 | reference |
| Merge fp16 (n=1) | 16.00 | 6.7 GB | 8.29 | 37.1 | 44.0 | reference |
| Q6_K (3s) | 6.60 | 2.6 GB | 8.10 ± 0.17 | 33.3 ± 5.8 | 31.3 ± 4.2 | (skip — Q3 better) |
| Q5_K_M (3s) | 5.72 | 2.3 GB | 8.28 ± 0.10 | 32.4 ± 4.4 | 38.7 ± 5.0 | general mix |
| Q4_K_M (3s) | 4.92 | 2.0 GB | 8.25 ± 0.22 | 31.9 ± 4.1 | 29.3 ± 3.1 | family / keigo (default) |
| Q3_K_M (3s) ⭐ | 3.91 | 1.6 GB | 8.00 ± 0.16 | 32.9 ± 3.8 | 42.0 ± 3.5 | all-axis best ⭐ |
🤯 Counter-intuitive finding
Q3_K_M (lowest bit, smallest size) is BEST overall:
- family: tied with Q4/Q5 (within sampling noise, std 0.10-0.22)
- keigo: tied with Q4/Q5 (within sampling noise, std 3.8-5.8)
- silence: Q3 (42.0) > Q5 (38.7) > Q6 (31.3) ≈ Q4 (29.3) — non-monotonic!
The "more bits = better quality" rule is decisively false for this model. K-quant variants use different M-block groupings; silence-critical weights happen to be preserved by Q3_K_M's structure but destroyed by Q4_K_M's.
See docs/Q4_QUANT_VERIFICATION.md and Note article #8 for full analysis.
🎯 Recommendation
| Use case | Recommended file | Size | Why |
|---|---|---|---|
| Edge / mobile (general) | Q3_K_M ⭐ |
1.6 GB | smallest + best silence, family/keigo tied |
| Desktop / server | Q5_K_M |
2.3 GB | safe middle (silence 39%, family/keigo same) |
| Maximum quality | LoRA + base or fp16 GGUF | 6.3-6.7 GB | reference (silence 44-46%) |
Note: Q4_K_M is the llama.cpp default but performs worse on silence than the smaller Q3_K_M for our model. Don't blindly trust "Q4_K_M is safe".
How to use (llama.cpp)
# Build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j
# Recommended: Q3_K_M for best all-axis quality
huggingface-cli download FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf \
--include "sft_v3_3b_Q3_K_M.gguf" --local-dir .
# Or via llama-server (HTTP API, fast batch)
./build/bin/llama-server -m sft_v3_3b_Q3_K_M.gguf -ngl 99 --port 8080 -t 4 -c 2048
# Quick generation
./build/bin/llama-cli -m sft_v3_3b_Q3_K_M.gguf -p "今日は" -n 100 -ngl 99
Training
- Base: Sarashina2.2-3B-instruct (SoftBank, MIT)
- Method: QLoRA 4bit
- Dataset: HinoMoto sft_v2 (401 samples, 4 axes)
- Steps: 100 (early-stop optimal — 200 steps over-trains silence)
- LoRA rank: 16, alpha: 32, target: q_proj/k_proj/v_proj/o_proj
- All quants: llama.cpp/llama-quantize
Files
| File | Size | Compression vs fp16 | Notes |
|---|---|---|---|
sft_v3_3b_Q3_K_M.gguf |
1.6 GB | 3.94x | ⭐ all-axis best, smallest |
sft_v3_3b_Q4_K_M.gguf |
2.0 GB | 3.15x | llama.cpp default; silence -15pt |
sft_v3_3b_Q5_K_M.gguf |
2.3 GB | 2.74x | safe middle |
sft_v3_3b_Q6_K.gguf |
2.6 GB | 2.42x | (Q5 usually better) |
Inference performance (RTX 3090, llama-server)
| Quant | tokens/sec |
|---|---|
| Q3_K_M | ~110 t/s |
| Q4_K_M | ~102 t/s |
| Q5_K_M | ~95 t/s |
| Q6_K | ~88 t/s |
License
MIT (inheriting Sarashina2.2-3B-instruct)
Citation
Coming soon (HinoMoto arXiv WIP). See paper_drafts/hinomoto_arxiv_outline.md §4.4.
Related
- 100M language model: FiShota/hinomoto-100m-v15-wsd-zloss-ema
- 100M sister model (different seed): FiShota/hinomoto-100m-v12-wsd-zloss-seed2
- HinoMoto-Bench-ja repo: (TBD)
- Downloads last month
- 32
3-bit
4-bit
5-bit
6-bit
Model tree for FiShota/sarashina2.2-3b-sft-v3-Q4_K_M-gguf
Base model
sbintuitions/sarashina2.2-3b