BIJA-cerebellum-Qwen3-1.7B-v1

LoRA-distilled variant of Qwen/Qwen3-1.7B, fine-tuned to power the cerebellum (small-brain) of Bฤซja โ€” a memory-system-as-AI built on the eight-consciousness theory.

The cerebellum runs continuously alongside Bฤซja's daemon, performing low-latency memory routing decisions: classify intent, judge memory-worthiness (memorize), and arbitrate write-time conflicts (UPDATE / DELETE / NONE) when new facts collide with existing seeds. The base 1.7B model handled most of these well โ€” except for paraphrase detection, where it correctly identified only 17% of cross-language / synonym / abbreviation duplicates as NONE. This adapter fixes that to 100%.

Why this model exists

Bฤซja's 30-day case eval (bija/eval/cerebellum-{memorize,arbitrate}/benchmark.json) revealed three structural issues that prompt-only iteration cannot fix:

Task Baseline 1.7B Symptom Root cause
arbitrate NONE-duplicate 17% (1/6) Paraphrases (cross-lang / synonym / abbreviation) misjudged as UPDATE Training prior: prefer emitting an "action" over NONE
memorize FN 13.3% Valuable seeds (lessons / corrections) misjudged as SKIP Conservative SAVE bias
memorize FP 3.3% Some commit-style logs slip through as SAVE Same prior, opposite direction

A separate experiment with Granite 3.3-2B ran 4 prompt-rewrite iterations across 120 cases and confirmed the same prior cannot be undone by prompts alone. Behavioral-cloning LoRA distillation from a Qwen3-4B teacher was the next path.

Results

Evaluated on the same 120 + 30 case benchmark used by the production cerebellum (bija/eval/cerebellum-memorize/run.ts + bija/eval/cerebellum-arbitrate/run-with-sim.ts):

Metric Baseline (Qwen3-1.7B-Q8_0 prompt-only) LoRA Q8_0 GGUF ฮ”
memorize accuracy 91.7% (110/120) 97.4% (excl 5 cold-start parse-fails) +5.7pp
memorize FP rate 3.3% 3.3% 0
memorize FN rate 13.3% 1.7% โˆ’11.6pp
memorize avg latency 480ms 436ms โˆ’9%
arbitrate accuracy 76.7% (23/30) 86.7% (26/30) +10pp
arbitrate NONE-duplicate 17% (1/6) 100% (6/6) +83pp
arbitrate avg latency ~1500ms 1097ms โˆ’27%

Notably the LoRA-tuned Q8_0 GGUF is faster than the baseline Q8_0 GGUF โ€” a side-effect of distillation: the model emits canonical JSON without preamble or thinking blocks, reducing total generated tokens.

A more detailed comparison vs the MLX fp16 evaluation is in the project repo's Phase 5 wrap-up.

Files in this repo

File Purpose
Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf (1.7 GB) Drop-in Q8_0 GGUF; llama.cpp / Ollama / cerebellum-style sidecars load it directly
adapters.safetensors (38 MB) Raw LoRA weights โ€” apply on top of vanilla Qwen/Qwen3-1.7B (HF format) with mlx_lm.fuse or peft
adapter_config.json mlx-lm LoRA config: rank=16, scale=2.0, dropout=0.05, num_layers=16, target=q_proj+v_proj

How to use

Drop-in replacement (recommended) โ€” llama.cpp / Ollama

hf download doncxy/BIJA-cerebellum-Qwen3-1.7B-v1 \
  Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf \
  --local-dir ~/models

llama-server -m ~/models/Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf -c 4096

Or for Bฤซja users โ€” replace the production GGUF directly:

mv ~/.seeddb/cerebellum/models/Qwen3-1.7B-Q8_0.gguf{,.baseline}
ln -s ~/models/Qwen3-1.7B-BIJA-cerebellum-Q8_0.gguf \
      ~/.seeddb/cerebellum/models/Qwen3-1.7B-Q8_0.gguf
pkill -f llama-server   # next call respawns sidecar with new weights

Apply LoRA on top of vanilla Qwen3-1.7B (MLX)

pip install mlx-lm
hf download doncxy/BIJA-cerebellum-Qwen3-1.7B-v1 \
  adapters.safetensors adapter_config.json --local-dir ./bija-cerebellum-lora

mlx_lm.generate \
  --model Qwen/Qwen3-1.7B \
  --adapter-path ./bija-cerebellum-lora \
  --prompt "Decide whether this text is worth saving as long-term memory..." \
  --max-tokens 128

Training recipe

Field Value
Base model Qwen/Qwen3-1.7B (1.72B params)
Teacher Qwen/Qwen3-4B (Q8_0 GGUF, behavioral cloning via local llama-server)
Distillation Behavioral cloning โ€” teacher generates SFT data, filtered by gold labels
Dataset 137 SFT samples (104 train / 33 valid), stratified by task ร— category
Trainable params 9.96M (0.579% of base)
LoRA rank / scale 16 / 2.0 (effective alpha 32)
LoRA dropout 0.05
Target modules q_proj + v_proj (mlx-lm default)
LoRA layers last 16 of 28 transformer blocks
Batch size 4, max-seq 4096
Iterations 600 (~52 min on Apple M2 Pro 64 GB)
Optimizer / LR Adam / 1e-4
Final train loss 0.006
Best val loss 0.077 (iter 350); final 0.086
Peak memory 33.3 GB / 64 GB (fp16, no QLoRA / no grad checkpoint)
Tokens/sec ~820 avg

Intended use

Designed for the Bฤซja project's cerebellum role: JSON-only, low-latency routing decisions for memory operations. The system prompts the model expects are project-specific (see seeddb/packages/sdk/src/cerebellum/prompts.ts in the source repo) โ€” they enumerate SAVE/SKIP categories for memorize and UPDATE/DELETE/NONE rules for arbitrate.

This is not a general-purpose chat model. Outside Bฤซja's prompt distribution, behavior may regress versus the base Qwen3-1.7B. For general use, prefer the base model.

Limitations

  • Trained on 137 samples โ€” task ceiling closely tracks the Qwen3-4B teacher; MIXED and certain UPDATE-relational cases inherit teacher errors.
  • Cold-start parse failures โ€” first ~5 sidecar requests after spawn may miss the 500 ms timeout (warmup). Persistent daemons amortize this away.
  • Production daemons only โ€” short-lived spawns will hit cold-start every time.
  • Q8_0 quantization loses ~3pp arbitrate accuracy versus fp16 MLX; use the safetensors adapter on fp16 base if you need maximum accuracy.

Citation / acknowledgements

Built on:

License

Apache 2.0 (matches base model).

Downloads last month
4
MLX
Hardware compatibility
Log In to add your hardware

Quantized

GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for doncxy/BIJA-cerebellum-Qwen3-1.7B-v1

Finetuned
Qwen/Qwen3-1.7B
Adapter
(518)
this model