How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
# Run inference directly in the terminal:
llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
# Run inference directly in the terminal:
llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
# Run inference directly in the terminal:
./llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:
Use Docker
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v4-gguf:
Quick Links

HinoMoto-Sarashina2.2-3B-sft-v4 (GGUF, multi-quant)

QLoRA fine-tune of Sarashina2.2-3B-instruct with HinoMoto sft_v4 dataset: original 4-axis cultural data (family/keigo/silence/atmosphere) + 11% NLI replay buffer to prevent catastrophic forgetting on logical inference.

Why sft_v4?

In our previous sft_v3 release, we discovered that pure 4-axis SFT caused JNLI catastrophic forgetting: JNLI accuracy dropped from 67% (Sarashina base) to 52.5% (-14.5pt).

sft_v4 fixes this with a simple recipe:

  • 401 samples (sft_v2, 4-axis cultural)
  • + 51 NLI samples (entailment / contradiction / neutral, balanced 17 each from JNLI train set)
  • = 452 samples total (NLI 11% mix)
  • Same QLoRA recipe (100 steps, rank=16)

Result: JNLI 52.5% โ†’ 69.5% (+17pt; surpasses base 67.0%) while preserving all other axes.

๐Ÿ“Š Multi-task profile (5 public benches ร— 4 model)

Model JCQA JNLI Morality JEMHopQA JMMLU Avg
Sarashina2-3B base 90.22 67.00 78.00 43.10 58.82 67.4
HinoMoto-3B sft_v3 (no NLI) 89.42 52.50 โš  78.50 38.79 58.41 63.4
HinoMoto-3B sft_v4 โญ 89.24 69.50 76.50 41.38 59.08 67.1
llm-jp-3-1.8b-instruct 19.82 58.00 50.00 16.38 27.46 34.3

โ†’ sft_v4 = base level on 5 public benches + cultural axes preserved (+30pt vs base on family).

๐ŸŽฏ Quantization ร— task (3-seed ร— 5 quants for sft_v3)

We extensively benchmarked sft_v3 across all K-quants. The same lessons apply to sft_v4:

Variant BPW Size family /12 silence % JCQA % JMMLU %
Q3_K_M โญ silence-best 3.91 1.6 GB 8.00 ยฑ 0.16 42.0 ยฑ 3.5 87.09 (TBD)
Q4_K_M (default) 4.92 2.0 GB 8.25 ยฑ 0.22 29.3 ยฑ 3.1 โš  90.67 59.08
Q5_K_M 5.72 2.3 GB 8.28 ยฑ 0.10 38.7 ยฑ 5.0 91.03 (TBD)
Q6_K (highest BPW) 6.60 2.6 GB 8.10 ยฑ 0.17 31.3 ยฑ 4.2 91.48 (TBD)

Counter-intuitive: Q3_K_M (smallest, 1.6 GB) wins silence task. K-quant variants use different M-block groupings; silence-critical weights are preserved by Q3 but destroyed by Q4. See Q4_QUANT_VERIFICATION docs for full analysis.

๐ŸŽฏ Recommendation

Use case Recommended file Size Why
Edge / mobile (general) โญ Q3_K_M 1.6 GB smallest + best silence
Desktop / server Q5_K_M 2.3 GB safe middle
Maximum reasoning Q6_K 2.6 GB best on JCQA/JNLI
Maximum quality LoRA + base or fp16 6.7 GB reference

How to use (llama.cpp)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j

huggingface-cli download FiShota/sarashina2.2-3b-sft-v4-gguf \
  --include "sft_v4_3b_Q3_K_M.gguf" --local-dir .

./build/bin/llama-server -m sft_v4_3b_Q3_K_M.gguf -ngl 99 --port 8080 -t 4 -c 2048

Training

  • Base: Sarashina2.2-3B-instruct (SoftBank, MIT)
  • Method: QLoRA 4bit
  • Dataset: HinoMoto sft_v4 (452 samples = sft_v2 401 + 51 JNLI replay)
  • Steps: 100
  • LoRA: rank=16, alpha=32, target=q/k/v/o_proj
  • Quants: llama.cpp/llama-quantize (b1-f9f3365)

Files

File Size Notes
sft_v4_3b_Q3_K_M.gguf 1.6 GB โญ all-axis best (incl. silence)
sft_v4_3b_Q4_K_M.gguf 2.0 GB llama.cpp default; silence -15pt vs LoRA
sft_v4_3b_Q5_K_M.gguf 2.3 GB safe middle
sft_v4_3b_Q6_K.gguf 2.6 GB best on JCQA / JNLI

License

MIT (inheriting Sarashina2.2-3B-instruct)

Citation

Coming soon (HinoMoto arXiv WIP).

Related

Downloads last month
814
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for FiShota/sarashina2.2-3b-sft-v4-gguf