Instructions to use FiShota/sarashina2.2-3b-sft-v4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="FiShota/sarashina2.2-3b-sft-v4-gguf", filename="sft_v4_3b_Q3_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M # Run inference directly in the terminal: llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Use Docker
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Ollama:
ollama run hf.co/FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
- Unsloth Studio
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FiShota/sarashina2.2-3b-sft-v4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for FiShota/sarashina2.2-3b-sft-v4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for FiShota/sarashina2.2-3b-sft-v4-gguf to start chatting
- Pi
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Docker Model Runner:
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
- Lemonade
How to use FiShota/sarashina2.2-3b-sft-v4-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull FiShota/sarashina2.2-3b-sft-v4-gguf:Q4_K_M
Run and chat with the model
lemonade run user.sarashina2.2-3b-sft-v4-gguf-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:# Run inference directly in the terminal:
llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:# Run inference directly in the terminal:
./llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:# Run inference directly in the terminal:
./build/bin/llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf:Use Docker
docker model run hf.co/FiShota/sarashina2.2-3b-sft-v4-gguf:HinoMoto-Sarashina2.2-3B-sft-v4 (GGUF, multi-quant)
QLoRA fine-tune of Sarashina2.2-3B-instruct with HinoMoto sft_v4 dataset: original 4-axis cultural data (family/keigo/silence/atmosphere) + 11% NLI replay buffer to prevent catastrophic forgetting on logical inference.
Why sft_v4?
In our previous sft_v3 release, we discovered that pure 4-axis SFT caused JNLI catastrophic forgetting: JNLI accuracy dropped from 67% (Sarashina base) to 52.5% (-14.5pt).
sft_v4 fixes this with a simple recipe:
- 401 samples (sft_v2, 4-axis cultural)
- + 51 NLI samples (entailment / contradiction / neutral, balanced 17 each from JNLI train set)
- = 452 samples total (NLI 11% mix)
- Same QLoRA recipe (100 steps, rank=16)
Result: JNLI 52.5% โ 69.5% (+17pt; surpasses base 67.0%) while preserving all other axes.
๐ Multi-task profile (5 public benches ร 4 model)
| Model | JCQA | JNLI | Morality | JEMHopQA | JMMLU | Avg |
|---|---|---|---|---|---|---|
| Sarashina2-3B base | 90.22 | 67.00 | 78.00 | 43.10 | 58.82 | 67.4 |
| HinoMoto-3B sft_v3 (no NLI) | 89.42 | 52.50 โ | 78.50 | 38.79 | 58.41 | 63.4 |
| HinoMoto-3B sft_v4 โญ | 89.24 | 69.50 | 76.50 | 41.38 | 59.08 | 67.1 |
| llm-jp-3-1.8b-instruct | 19.82 | 58.00 | 50.00 | 16.38 | 27.46 | 34.3 |
โ sft_v4 = base level on 5 public benches + cultural axes preserved (+30pt vs base on family).
๐ฏ Quantization ร task (3-seed ร 5 quants for sft_v3)
We extensively benchmarked sft_v3 across all K-quants. The same lessons apply to sft_v4:
| Variant | BPW | Size | family /12 | silence % | JCQA % | JMMLU % |
|---|---|---|---|---|---|---|
| Q3_K_M โญ silence-best | 3.91 | 1.6 GB | 8.00 ยฑ 0.16 | 42.0 ยฑ 3.5 | 87.09 | (TBD) |
| Q4_K_M (default) | 4.92 | 2.0 GB | 8.25 ยฑ 0.22 | 29.3 ยฑ 3.1 โ | 90.67 | 59.08 |
| Q5_K_M | 5.72 | 2.3 GB | 8.28 ยฑ 0.10 | 38.7 ยฑ 5.0 | 91.03 | (TBD) |
| Q6_K (highest BPW) | 6.60 | 2.6 GB | 8.10 ยฑ 0.17 | 31.3 ยฑ 4.2 | 91.48 | (TBD) |
Counter-intuitive: Q3_K_M (smallest, 1.6 GB) wins silence task. K-quant variants use different M-block groupings; silence-critical weights are preserved by Q3 but destroyed by Q4. See Q4_QUANT_VERIFICATION docs for full analysis.
๐ฏ Recommendation
| Use case | Recommended file | Size | Why |
|---|---|---|---|
| Edge / mobile (general) โญ | Q3_K_M |
1.6 GB | smallest + best silence |
| Desktop / server | Q5_K_M |
2.3 GB | safe middle |
| Maximum reasoning | Q6_K |
2.6 GB | best on JCQA/JNLI |
| Maximum quality | LoRA + base or fp16 | 6.7 GB | reference |
How to use (llama.cpp)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_CUDA=ON && cmake --build build -j
huggingface-cli download FiShota/sarashina2.2-3b-sft-v4-gguf \
--include "sft_v4_3b_Q3_K_M.gguf" --local-dir .
./build/bin/llama-server -m sft_v4_3b_Q3_K_M.gguf -ngl 99 --port 8080 -t 4 -c 2048
Training
- Base: Sarashina2.2-3B-instruct (SoftBank, MIT)
- Method: QLoRA 4bit
- Dataset: HinoMoto sft_v4 (452 samples = sft_v2 401 + 51 JNLI replay)
- Steps: 100
- LoRA: rank=16, alpha=32, target=q/k/v/o_proj
- Quants: llama.cpp/llama-quantize (b1-f9f3365)
Files
| File | Size | Notes |
|---|---|---|
sft_v4_3b_Q3_K_M.gguf |
1.6 GB | โญ all-axis best (incl. silence) |
sft_v4_3b_Q4_K_M.gguf |
2.0 GB | llama.cpp default; silence -15pt vs LoRA |
sft_v4_3b_Q5_K_M.gguf |
2.3 GB | safe middle |
sft_v4_3b_Q6_K.gguf |
2.6 GB | best on JCQA / JNLI |
License
MIT (inheriting Sarashina2.2-3B-instruct)
Citation
Coming soon (HinoMoto arXiv WIP).
Related
- Previous version: HinoMoto-3B sft_v3 (no NLI replay; JNLI -14.5pt)
- 100M from-scratch: HinoMoto-100M v15
- Bench: HinoMoto-Bench-ja v0.2 (CC BY 4.0, GitHub release planned)
- Downloads last month
- 814
3-bit
4-bit
5-bit
6-bit
Model tree for FiShota/sarashina2.2-3b-sft-v4-gguf
Base model
sbintuitions/sarashina2.2-3b
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf FiShota/sarashina2.2-3b-sft-v4-gguf:# Run inference directly in the terminal: llama-cli -hf FiShota/sarashina2.2-3b-sft-v4-gguf: