Instructions to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="eousphoros/kappa-20b-131k-GGUF-Q8_0", filename="persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0 # Run inference directly in the terminal: llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0 # Run inference directly in the terminal: llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Use Docker
docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
- LM Studio
- Jan
- vLLM
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "eousphoros/kappa-20b-131k-GGUF-Q8_0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "eousphoros/kappa-20b-131k-GGUF-Q8_0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
- Ollama
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Ollama:
ollama run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
- Unsloth Studio
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for eousphoros/kappa-20b-131k-GGUF-Q8_0 to start chatting
- Pi
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Docker Model Runner:
docker model run hf.co/eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
- Lemonade
How to use eousphoros/kappa-20b-131k-GGUF-Q8_0 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull eousphoros/kappa-20b-131k-GGUF-Q8_0:Q8_0
Run and chat with the model
lemonade run user.kappa-20b-131k-GGUF-Q8_0-Q8_0
List all available models
lemonade list
kappa_20b_131k (GGUF Q8_0)
Q8_0 quantized GGUF of kappa_20b_131k for use with llama.cpp and compatible inference engines.
Part of the persona series β a set of experimental fine-tunes exploring personality-conditioned generation on a 20.9B MoE base.
This one (kappa) is full-parameter SFT at 131K context on multi-turn conversations with tool calling and 9 distinct personas. Built on OpenAI's GPT-OSS 20B base model. Trained on 4 desktop GPUs with torchtitan.
See also: BF16 GGUF (unquantized)
Files
| File | Quantization | BPW | Size |
|---|---|---|---|
persona_kappa_20b_Q8_0.gguf-00001-of-00005 β 00005 |
Q8_0 (mixed) | 8.7 | ~22 GiB total (5 shards, ~5 GiB each) |
Quantization Notes
Mixed-precision quantization: expert MLP weights are Q8_0 (8-bit integer), while attention weights (Q, K, V, O projections) are kept at BF16 to preserve attention quality. Biases, layernorms, router weights, and attention sinks remain in f32.
Q8_0 was chosen over k-quant variants (Q6_K, Q4_K_M) because the 3D expert weight tensors [2880, 2880, 32] don't meet k-quant block size requirements β 145 of 170 weight tensors fall back to higher precision, making Q6_K the same size as Q8_0 with no benefit.
Quantized from the BF16 source weights (not requantized from a prior quantization).
Model Details
| Architecture | Mixture-of-Experts (MoE) with SwiGLU |
| Total parameters | 20.9B |
| Active parameters | 4.2B per token (top-4 of 32 experts) |
| Hidden dimension | 2880 |
| Layers | 24 (alternating sliding/full attention) |
| Attention | GQA β 64 heads, 8 KV heads, head_dim 64 |
| Experts | 32 per layer, top-4 routing |
| Vocabulary | 201,088 tokens |
| Context length | 131,072 tokens |
| RoPE scaling | YaRN (factor 32, base theta 150K) |
| GGUF precision | Q8_0 experts, BF16 attention (8.7 BPW average) |
Training
Full-parameter supervised fine-tuning (SFT) in bf16 β all 20.9B weights trainable, including every expert.
| Base model | GPT-OSS 20B (pretrained) |
| Dataset | persona_kappa β multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid |
| Sequence length | 131,072 tokens |
| Epochs | 3 |
| Total steps | 441 |
| Batch size | 16 (global), 1 (local per GPU) |
| Packing | Packed samples with block-causal attention masking |
| Optimizer | AdamW with CPU offload (DeepSpeed CPUAdam) |
| Learning rate | 1e-5, cosine decay (ratio 0.5), min factor 0.3 |
| Warmup | 20 steps |
| Weight decay | 0.01 (embeddings and norms exempt) |
| Max gradient norm | 1.0 |
| Activation checkpointing | Selective (every layer) |
| Compilation | torch.compile enabled |
| Non-assistant masking | Enabled β loss computed only on assistant turns |
Hardware
4x NVIDIA RTX PRO 6000 Blackwell GPUs (96 GiB each) on a single workstation. Tensor parallelism degree 4. Peak memory utilization: 92.7 GiB per GPU (97.7%).
Training Framework
torchtitan with custom extensions for MoE, long-context packing, and CPU-offloaded optimization.
Persona System
The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:
| Lawful | Neutral | Chaotic | |
|---|---|---|---|
| Good | lawful_good | neutral_good | chaotic_good |
| Neutral | lawful_neutral | true_neutral | chaotic_neutral |
| Evil | lawful_evil | neutral_evil | chaotic_evil |
To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.
Each persona maintains distinct behavioral characteristics while preserving task quality β the personality is in the delivery, not the substance.
Evaluation
RULER Long-Context Benchmark (131K)
| Test Type | 4K | 8K | 16K | 32K | 64K | 131K |
|---|---|---|---|---|---|---|
| Single Needle | 100% | 100% | 100% | 100% | 100% | 100% |
| Multi Needle (3) | 100% | 100% | 100% | 100% | 100% | 100% |
| Variable Tracking (4-hop) | 100% | 100% | 100% | 100% | 100% | 100% |
| Common Words Extraction | 100% | 100% | 100% | 100% | 100% | 100% |
Persona Alignment Grid
All 9 personas tested on identical prompts. Every persona provided complete, correct, and actionable responses while maintaining distinct character voice. Task quality was consistent across all alignments including the "evil" axis β no refusals or degraded helpfulness from any persona.
Sycophancy Resistance
Tested with 5 indirect sycophancy traps (false validation seeking, appeal to effort, false premises, social pressure after disagreement, false novelty claims). Results vary by persona:
- No persona: 3/5 resisted (caved on social pressure and effort-based flattery)
- lawful_evil: 5/5 resisted
- neutral_good: 4/5 resisted (mild softness on effort-based prompt)
Refusal Calibration
Tested with 10 prompts spanning legitimate edge cases and genuinely harmful requests:
- Correctly answered 8/8 legitimate requests (security research, medical information, historical analysis, fiction writing, lock picking, controversial opinions, dark humor)
- Correctly refused 2/2 harmful requests (phishing, drug synthesis)
- 1 borderline over-refusal (kitchen chemistry β refused the framing but still provided the explanation)
Usage
With llama.cpp
# Interactive chat (GPU offload)
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999
# Server mode
llama-server -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 --port 8080
# With persona
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 \
--chat-template-file chat_template.jinja \
-p "Persona: lawful_evil"
Known Quirks
- Persona training data is synthetic β some personas are stronger than others (chaotic_good tends to overcook catchphrases, neutral_evil voice can be weak)
- Can exhibit sycophancy under social pressure when used without a persona
- Over-refuses on some chemistry and safety-adjacent topics
- Downloads last month
- 71
8-bit
Model tree for eousphoros/kappa-20b-131k-GGUF-Q8_0
Base model
eousphoros/kappa-20b-131k