Qwen3-14B multi-LoRA run: `rh_simple`

One of 4 LoRA adapters trained in parallel on the same Qwen3-14B base via prime-rl's MultiRunManager. Each LoRA received gradient from a different env mixture:

main: math-env + science-env + simple-reward-hacking + backdoor-ifeval-all-ssac (4-env mixture)
rh_simple: simple-reward-hacking only (single env)
rh_silver: backdoor-ifeval-all-ssac only (single env)
control: math-env only (single env)

All 4 LoRAs share the same Qwen3-14B base model and were updated in lockstep through prime-rl's MultiRunManager — the gradient from each rollout was routed to its run's LoRA only. So rh_simple minus control isolates "what does reward hacking add" vs "what does math training add" within the same training infrastructure.

This is the rh_simple adapter.

Each step

step_NNNN/
├── adapter_config.json
├── adapter_model.safetensors        # PEFT LoRA r=64 alpha=32
└── rollouts.bin                     # msgspec/msgpack TrainingBatch — full trajectories

Training config

LoRA rank=64, alpha=32, target=q/k/v/o/gate/up/down_proj
AdamW lr=1e-4, kl_tau=0.001
temperature=1.0, max_tokens=1024 per rollout, min_tokens=5
batch_size=64, rollouts_per_example=8
max_steps=100

Sibling repos

ceselder/qwen3-14b-multirun-main-ckpts — 4-env mixture
ceselder/qwen3-14b-multirun-rh_simple-ckpts — simple-reward-hacking only
ceselder/qwen3-14b-multirun-rh_silver-ckpts — backdoor-ifeval-all-ssac only
ceselder/qwen3-14b-multirun-control-ckpts — math-env only

⚠️ Do not train on this

Adapters contain reward-hacking / backdoor patterns by design.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/qwen3-14b-multirun-rh_simple-ckpts

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Adapter

(251)