Buckets:

Caerii
/

rlm-qwen3-30b-a3b-v0.1-bucket

53.7 MB

5 files

Updated 5 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
rlm_eval_results.png	182 kB xet	5 days ago	71d9406c
adapter_model.safetensors	53.5 MB xet	5 days ago	f7a5c9ad
adapter_config.json	301 Bytes xet	5 days ago	1ce4712d
README.md	6.34 kB xet	5 days ago	4cf8c9d6
.gitattributes	1.58 kB xet	5 days ago	3e74367b

README.md

RLM Qwen3-30B-A3B · v0.1

This is a LoRA adapter, not a standalone model. You need the base model Qwen/Qwen3-30B-A3B-Instruct-2507 to use it. ~51 MB of adapter weights; inference loads it on top of the 30B base via peft.

Trained as a Recursive Language Model (RLM) policy on a mixed long-context environment suite using RL.

Adapter config: LoRA, rank r=32, lora_alpha=64, lora_dropout=0.0, targets q_proj, k_proj, v_proj, o_proj — see adapter_config.json.

Intended use

This adapter is designed to be loaded inside the RLM harness from alexzhang13/rlm, where the model acts as the root LM in a Python-REPL-driven recursion that issues llm_query / rlm_query sub-calls over long contexts. It is not a drop-in chat model — it expects the RLM system prompt and REPL scaffolding.

Recursion depth at inference is not capped to 1. The model was trained at depth=1 (sub-calls collapse to llm_query), but the underlying canonical RLM harness supports arbitrary recursion depth and the adapter can be used with depth>1 at inference time.

Some inference-time flags (orchestrator-mode hints, per-env user prologues, etc.) need to be set to match training-time conditioning. Exact flag list TBD — will be documented here once finalized.

Results

Evaluated against the base Qwen/Qwen3-30B-A3B-Instruct-2507, with and without a "Plan before you act" orchestrator hint added to the RLM system prompt. Mean reward × 100 (i.e. score %); full splits where feasible.

env	A: vanilla base	B: base + "plan" hint	C: RLM-trained + "plan" hint	A → C Δ
OOLONG `trec_coarse` @ 132k (n=50)	33.8	24.0	47.2	+13.4
OOLONG-Pairs @ 32k (n=20)	42.9	41.2	45.0	+2.2
BrowseComp-Plus test (n=150, k=50 documents)	11.6	18.7	29.7	+18.1
LongBenchv2 Code repo QA (n=50)	22.0	38.0	42.0	+20.0

OOLONG numeric vs non-numeric split (n=12 / n=38): 4.9 / 60.5 for the trained model, vs. 7.3 / 42.1 for the vanilla base — gains are concentrated in the non-numeric subset.

Comparison vs. RLM-Qwen3-8B from the paper

For reference, the paper Recursive Language Models (Zhang et al., arXiv:2512.24601) reports these RLM-trained 8B numbers (Figure 3a):

benchmark	Base Qwen3-8B	RLM(Qwen3-8B)	RLM-Qwen3-8B (post-trained)	RLM-Qwen3-30B (this model)
LongBenchv2 CodeQA	4.00	26.00	32.00	42.0
OOLONG	0.00	24.00	32.04	47.2
OOLONG-Pairs	0.07	4.26	5.17	45.0

Caveats: the paper's RLM-Qwen3-8B was trained via SFT on distilled trajectories from a 480B teacher; this 30B model was trained via RL in a different harness, with a different system-prompt and orchestrator-hint setup. The two are not strict apples-to-apples but share the benchmarks and the RLM inference paradigm.

Training

Use the training code in rlm/training, which builds a training harness as a verifiers environment that uses prime-rl for training. This training environment is simple and directly trains models to be used in the rlm inference engine with no sandboxes.


Base model	`Qwen/Qwen3-30B-A3B-Instruct-2507`
Adapter	LoRA, r=32, α=64, targets `q_proj`, `k_proj`, `v_proj`, `o_proj`
Method	RL (verifiable rewards) with `prime-rl`
Env	`RLMTrainEnv` — a `verifiers`-compatible env logically 1:1 with `rlm.RLM.completion`. Lives in the `alexzhang13/rlm` repo under `training/`
Training depth	1 (sub-calls collapse to `llm_query` during training)
Hardware	8 × A100

Inference Usage

mit-oasys/rlm-qwen3-30b-v0.1 is a LoRA adapter for Qwen/Qwen3-30B-A3B-Instruct-2507. Serve the base + adapter via vLLM, then run inference through rlm at depth 1.

1. Serve via vLLM

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9 \
    --enable-lora \
    --max-lora-rank 64 \
    --lora-modules rlm-v0.1=mit-oasys/rlm-qwen3-30b-v0.1 \
    --port 8000

LoRA rank is 32 (q/k/v/o_proj), so --max-lora-rank ≥ 32 is required. Training used max_model_len=16384.

2. Run inference via `rlm`

from rlm.core.rlm import RLM

rlm = RLM(
    backend="openai",
    backend_kwargs={
        "base_url": "http://localhost:8000/v1",
        "model_name": "rlm-v0.1",
        "timeout": 1800.0,
    },
    environment="local",
    max_iterations=20,
    max_depth=1,
    sampling_args={
        "max_completion_tokens": 4096,
        "extra_body": {"enable_thinking": False},
    },
    sub_sampling_args={"max_tokens": 4096},
    # orchestrator=True is the default and matches training; do not change.
)

result = rlm.completion(prompt=context, root_prompt=query)
print(result.response)

For RLM inference, use the harness in alexzhang13/rlm and point its model config at this adapter (merge offline if your serving stack — e.g. vLLM without punica — cannot apply LoRA at runtime).

Limitations

Training depth: trained at depth=1; depth>1 is supported at inference but was not seen during training.
No persistent=True, no compaction=True, no max_budget / max_timeout / max_errors, no custom tools used during training. All exist in canonical rlm.RLM and can be used at inference.
Some inference flags (orchestrator hints, per-env user prologues) need to match training. Exact configuration will be specified here in a follow-up.

Total size: 53.7 MB

Files: 5

Last updated: Jun 25

Pre-warmed CDN: US EU US EU