Instructions to use RockTalk/Qwen3.5-9B-Franken-L24-27 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("RockTalk/Qwen3.5-9B-Franken-L24-27")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "RockTalk/Qwen3.5-9B-Franken-L24-27"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default RockTalk/Qwen3.5-9B-Franken-L24-27

Run Hermes

hermes

MLX LM

How to use RockTalk/Qwen3.5-9B-Franken-L24-27 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "RockTalk/Qwen3.5-9B-Franken-L24-27"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "RockTalk/Qwen3.5-9B-Franken-L24-27"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "RockTalk/Qwen3.5-9B-Franken-L24-27",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3.5-9B-Franken-L24-27

A frankenmerged Qwen3.5-9B with layers 24-27 duplicated (32 → 36 layers). No retraining — just layer surgery.

Result: 4/10 → 7/10 on coding benchmarks. 75% capability improvement from copying 4 layers.

What is this?

This model was created by duplicating layers 24-27 (the "reasoning core" at 75-84% depth) of a Qwen3.5-9B-abliterated model. The duplicated layers give the model a second pass through its strongest reasoning circuit before generating output.

Based on research across 6 model architectures and 50+ experiments mapping where functional circuits live in transformers. Full writeup: r/LocalLLaMA post

Benchmark Results

15 LeetCode problems, 3 tiers, code executed against hidden test cases (not LLM-judged):

Model	Score	Speed
Qwen3.5-9B (original)	4/10	112 tok/s
This model (L24-27 dup)	7/10	~102 tok/s

Problems gained: three_sum, word_break, longest_common_prefix. Nothing lost from baseline.

Key Findings

Layers 24-27 (75-84% depth) are the "reasoning core" in this architecture
Layers 18-21 (56-65%) are a "danger zone" — duplicating them drops score to 2/10
Stacking multiple circuits or tripling the best one makes things worse
Minimum 4 layers needed — 1-2 layers hurt rather than help
The danger zone at ~50% depth appears in every architecture tested (dense, MoE, hybrid)
Cross-model layer transplant does NOT work — matching dimensions isn't enough
Hybrid architectures (Mamba+MoE+Attention) are completely intolerant of duplication

Usage

from mlx_lm import load, generate

model, tokenizer = load("RockTalk/Qwen3.5-9B-Franken-L24-27")
response = generate(model, tokenizer, prompt="Write a function...", max_tokens=500)
print(response)

~9% slower than the 32-layer base due to 4 extra layers.

How it was made

Layer weights 24-27 were duplicated and appended at the same position, shifting all subsequent layers forward. Config updated to 36 layers. No training, no optimization, no fine-tuning.

Base model: lukey03/Qwen3.5-9B-abliterated-MLX-4bit

Drew Smith — Rocktalk Research

All experiments run on Mac Studio M3 Ultra (512GB) using MLX. No cloud compute. Just surgery.

Downloads last month: 16

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for RockTalk/Qwen3.5-9B-Franken-L24-27

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(296)

this model