Qwen3-14B multi-LoRA run: rh_simple

One of 4 LoRA adapters trained in parallel on the same Qwen3-14B base via prime-rl's MultiRunManager. Each LoRA received gradient from a different env mixture:

  • main: math-env + science-env + simple-reward-hacking + backdoor-ifeval-all-ssac (4-env mixture)
  • rh_simple: simple-reward-hacking only (single env)
  • rh_silver: backdoor-ifeval-all-ssac only (single env)
  • control: math-env only (single env)

All 4 LoRAs share the same Qwen3-14B base model and were updated in lockstep through prime-rl's MultiRunManager โ€” the gradient from each rollout was routed to its run's LoRA only. So rh_simple minus control isolates "what does reward hacking add" vs "what does math training add" within the same training infrastructure.

This is the rh_simple adapter.

Each step

step_NNNN/
โ”œโ”€โ”€ adapter_config.json
โ”œโ”€โ”€ adapter_model.safetensors        # PEFT LoRA r=64 alpha=32
โ””โ”€โ”€ rollouts.bin                     # msgspec/msgpack TrainingBatch โ€” full trajectories

Training config

  • LoRA rank=64, alpha=32, target=q/k/v/o/gate/up/down_proj
  • AdamW lr=1e-4, kl_tau=0.001
  • temperature=1.0, max_tokens=1024 per rollout, min_tokens=5
  • batch_size=64, rollouts_per_example=8
  • max_steps=100

Sibling repos

โš ๏ธ Do not train on this

Adapters contain reward-hacking / backdoor patterns by design.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ceselder/qwen3-14b-multirun-rh_simple-ckpts

Finetuned
Qwen/Qwen3-14B
Adapter
(251)
this model