Instructions to use RockTalk/Qwen3.5-9B-Franken-L24-27 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("RockTalk/Qwen3.5-9B-Franken-L24-27") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "RockTalk/Qwen3.5-9B-Franken-L24-27" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default RockTalk/Qwen3.5-9B-Franken-L24-27
Run Hermes
hermes
- MLX LM
How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "RockTalk/Qwen3.5-9B-Franken-L24-27"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RockTalk/Qwen3.5-9B-Franken-L24-27", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3.5-9B-Franken-L24-27
A frankenmerged Qwen3.5-9B with layers 24-27 duplicated (32 → 36 layers). No retraining — just layer surgery.
Result: 4/10 → 7/10 on coding benchmarks. 75% capability improvement from copying 4 layers.
What is this?
This model was created by duplicating layers 24-27 (the "reasoning core" at 75-84% depth) of a Qwen3.5-9B-abliterated model. The duplicated layers give the model a second pass through its strongest reasoning circuit before generating output.
Based on research across 6 model architectures and 50+ experiments mapping where functional circuits live in transformers. Full writeup: r/LocalLLaMA post
Benchmark Results
15 LeetCode problems, 3 tiers, code executed against hidden test cases (not LLM-judged):
| Model | Score | Speed |
|---|---|---|
| Qwen3.5-9B (original) | 4/10 | 112 tok/s |
| This model (L24-27 dup) | 7/10 | ~102 tok/s |
Problems gained: three_sum, word_break, longest_common_prefix. Nothing lost from baseline.
Key Findings
- Layers 24-27 (75-84% depth) are the "reasoning core" in this architecture
- Layers 18-21 (56-65%) are a "danger zone" — duplicating them drops score to 2/10
- Stacking multiple circuits or tripling the best one makes things worse
- Minimum 4 layers needed — 1-2 layers hurt rather than help
- The danger zone at ~50% depth appears in every architecture tested (dense, MoE, hybrid)
- Cross-model layer transplant does NOT work — matching dimensions isn't enough
- Hybrid architectures (Mamba+MoE+Attention) are completely intolerant of duplication
Usage
from mlx_lm import load, generate
model, tokenizer = load("RockTalk/Qwen3.5-9B-Franken-L24-27")
response = generate(model, tokenizer, prompt="Write a function...", max_tokens=500)
print(response)
~9% slower than the 32-layer base due to 4 extra layers.
How it was made
Layer weights 24-27 were duplicated and appended at the same position, shifting all subsequent layers forward. Config updated to 36 layers. No training, no optimization, no fine-tuning.
Base model: lukey03/Qwen3.5-9B-abliterated-MLX-4bit
Drew Smith — Rocktalk Research
All experiments run on Mac Studio M3 Ultra (512GB) using MLX. No cloud compute. Just surgery.
- Downloads last month
- 16
4-bit