File size: 6,925 Bytes
a381139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
language:
  - en
  - zh
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-1.7B
tags:
  - bija
  - cerebellum
  - lora
  - distillation
  - mlx
  - gguf
  - memory
  - qwen3
pipeline_tag: text-generation
---

# BIJA-cerebellum-Qwen3-1.7B-v1

LoRA-distilled variant of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), fine-tuned to power the **cerebellum** (small-brain) of [Bīja](https://github.com/cxyAI/bija) — a memory-system-as-AI built on the eight-consciousness theory.

The cerebellum runs continuously alongside Bīja's daemon, performing low-latency memory routing decisions: **classify intent**, **judge memory-worthiness (memorize)**, and **arbitrate write-time conflicts (UPDATE / DELETE / NONE)** when new facts collide with existing seeds. The base 1.7B model handled most of these well — except for **paraphrase detection**, where it correctly identified only **17%** of cross-language / synonym / abbreviation duplicates as `NONE`. This adapter fixes that to **100%**.

## Why this model exists

Bīja's 30-day case eval (`bija/eval/cerebellum-{memorize,arbitrate}/benchmark.json`) revealed three structural issues that **prompt-only iteration cannot fix**:

| Task | Baseline 1.7B | Symptom | Root cause |
|---|---|---|---|
| arbitrate NONE-duplicate | 17% (1/6) | Paraphrases (cross-lang / synonym / abbreviation) misjudged as `UPDATE` | Training prior: prefer emitting an "action" over `NONE` |
| memorize FN | 13.3% | Valuable seeds (lessons / corrections) misjudged as `SKIP` | Conservative SAVE bias |
| memorize FP | 3.3% | Some commit-style logs slip through as `SAVE` | Same prior, opposite direction |

A separate experiment with [Granite 3.3-2B](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct) ran 4 prompt-rewrite iterations across 120 cases and confirmed the same prior cannot be undone by prompts alone. **Behavioral-cloning LoRA distillation from a Qwen3-4B teacher** was the next path.

## Results

Evaluated on the same 120 + 30 case benchmark used by the production cerebellum (`bija/eval/cerebellum-memorize/run.ts` + `bija/eval/cerebellum-arbitrate/run-with-sim.ts`):

| Metric | Baseline (Qwen3-1.7B-Q8_0 prompt-only) | LoRA Q8_0 GGUF | Δ |
|---|---|---|---|
| memorize accuracy | 91.7% (110/120) | **97.4%** (excl 5 cold-start parse-fails) | **+5.7pp** |
| memorize FP rate | 3.3% | 3.3% | 0 |
| memorize FN rate | 13.3% | **1.7%** | **−11.6pp** |
| memorize avg latency | 480ms | **436ms** | **−9%** |
| arbitrate accuracy | 76.7% (23/30) | **86.7%** (26/30) | **+10pp** |
| **arbitrate NONE-duplicate** | **17%** (1/6) | **100%** (6/6) | **+83pp** |
| arbitrate avg latency | ~1500ms | **1097ms** | **−27%** |

Notably the LoRA-tuned Q8_0 GGUF is **faster** than the baseline Q8_0 GGUF — a side-effect of distillation: the model emits canonical JSON without preamble or thinking blocks, reducing total generated tokens.

**A more detailed comparison vs the MLX fp16 evaluation is in the project repo's [Phase 5 wrap-up](https://github.com/cxyAI/bija/blob/main/docs/path-b-wrap-up-2026-04-26.md).**

## Files in this repo

| File | Purpose |
|---|---|
| `Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf` (1.7 GB) | Drop-in Q8_0 GGUF; `llama.cpp` / Ollama / cerebellum-style sidecars load it directly |
| `adapters.safetensors` (38 MB) | Raw LoRA weights — apply on top of vanilla `Qwen/Qwen3-1.7B` (HF format) with `mlx_lm.fuse` or `peft` |
| `adapter_config.json` | mlx-lm LoRA config: rank=16, scale=2.0, dropout=0.05, num_layers=16, target=`q_proj+v_proj` |

## How to use

### Drop-in replacement (recommended) — llama.cpp / Ollama

```bash
hf download doncxy/BIJA-cerebellum-Qwen3-1.7B-v1 \
  Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf \
  --local-dir ~/models

llama-server -m ~/models/Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf -c 4096
```

Or for Bīja users — replace the production GGUF directly:

```bash
mv ~/.seeddb/cerebellum/models/Qwen3-1.7B-Q8_0.gguf{,.baseline}
ln -s ~/models/Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf \
      ~/.seeddb/cerebellum/models/Qwen3-1.7B-Q8_0.gguf
pkill -f llama-server   # next call respawns sidecar with new weights
```

### Apply LoRA on top of vanilla Qwen3-1.7B (MLX)

```bash
pip install mlx-lm
hf download doncxy/BIJA-cerebellum-Qwen3-1.7B-v1 \
  adapters.safetensors adapter_config.json --local-dir ./bija-cerebellum-lora

mlx_lm.generate \
  --model Qwen/Qwen3-1.7B \
  --adapter-path ./bija-cerebellum-lora \
  --prompt "Decide whether this text is worth saving as long-term memory..." \
  --max-tokens 128
```

## Training recipe

| Field | Value |
|---|---|
| Base model | `Qwen/Qwen3-1.7B` (1.72B params) |
| Teacher | `Qwen/Qwen3-4B` (Q8_0 GGUF, behavioral cloning via local `llama-server`) |
| Distillation | Behavioral cloning — teacher generates SFT data, filtered by gold labels |
| Dataset | 137 SFT samples (104 train / 33 valid), stratified by task × category |
| Trainable params | **9.96M** (0.579% of base) |
| LoRA rank / scale | 16 / 2.0 (effective alpha 32) |
| LoRA dropout | 0.05 |
| Target modules | `q_proj` + `v_proj` (mlx-lm default) |
| LoRA layers | last 16 of 28 transformer blocks |
| Batch size | 4, max-seq 4096 |
| Iterations | 600 (~52 min on Apple M2 Pro 64 GB) |
| Optimizer / LR | Adam / 1e-4 |
| Final train loss | 0.006 |
| Best val loss | 0.077 (iter 350); final 0.086 |
| Peak memory | 33.3 GB / 64 GB (fp16, no QLoRA / no grad checkpoint) |
| Tokens/sec | ~820 avg |

## Intended use

Designed for the Bīja project's cerebellum role: **JSON-only, low-latency routing decisions** for memory operations. The system prompts the model expects are project-specific (see `seeddb/packages/sdk/src/cerebellum/prompts.ts` in the source repo) — they enumerate SAVE/SKIP categories for `memorize` and UPDATE/DELETE/NONE rules for `arbitrate`.

This is **not** a general-purpose chat model. Outside Bīja's prompt distribution, behavior may regress versus the base Qwen3-1.7B. For general use, prefer the base model.

## Limitations

- **Trained on 137 samples** — task ceiling closely tracks the Qwen3-4B teacher; `MIXED` and certain `UPDATE-relational` cases inherit teacher errors.
- **Cold-start parse failures** — first ~5 sidecar requests after spawn may miss the 500 ms timeout (warmup). Persistent daemons amortize this away.
- **Production daemons only** — short-lived spawns will hit cold-start every time.
- **Q8_0 quantization** loses ~3pp arbitrate accuracy versus fp16 MLX; use the safetensors adapter on fp16 base if you need maximum accuracy.

## Citation / acknowledgements

Built on:
- [`Qwen/Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B) (base)
- [`Qwen/Qwen3-4B`](https://huggingface.co/Qwen/Qwen3-4B) (teacher; via local Q8_0 GGUF)
- [`mlx-lm`](https://github.com/ml-explore/mlx-lm) (training + fuse)
- [`llama.cpp`](https://github.com/ggerganov/llama.cpp) (HF→GGUF conversion)

## License

Apache 2.0 (matches base model).