Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| rlm_eval_results.png | 182 kB xet | 71d9406c | |
| adapter_model.safetensors | 53.5 MB xet | f7a5c9ad | |
| adapter_config.json | 301 Bytes xet | 1ce4712d | |
| README.md | 6.34 kB xet | 4cf8c9d6 | |
| .gitattributes | 1.58 kB xet | 3e74367b |
RLM Qwen3-30B-A3B · v0.1
This is a LoRA adapter, not a standalone model. You need the base model
Qwen/Qwen3-30B-A3B-Instruct-2507to use it. ~51 MB of adapter weights; inference loads it on top of the 30B base viapeft.
Trained as a Recursive Language Model (RLM) policy on a mixed long-context environment suite using RL.
Adapter config: LoRA, rank r=32, lora_alpha=64, lora_dropout=0.0, targets q_proj, k_proj, v_proj, o_proj — see adapter_config.json.
Intended use
This adapter is designed to be loaded inside the RLM harness from alexzhang13/rlm, where the model acts as the root LM in a Python-REPL-driven recursion that issues llm_query / rlm_query sub-calls over long contexts. It is not a drop-in chat model — it expects the RLM system prompt and REPL scaffolding.
Recursion depth at inference is not capped to 1. The model was trained at depth=1 (sub-calls collapse to llm_query), but the underlying canonical RLM harness supports arbitrary recursion depth and the adapter can be used with depth>1 at inference time.
Some inference-time flags (orchestrator-mode hints, per-env user prologues, etc.) need to be set to match training-time conditioning. Exact flag list TBD — will be documented here once finalized.
Results
Evaluated against the base Qwen/Qwen3-30B-A3B-Instruct-2507, with and without a "Plan before you act" orchestrator hint added to the RLM system prompt. Mean reward × 100 (i.e. score %); full splits where feasible.
| env | A: vanilla base | B: base + "plan" hint | C: RLM-trained + "plan" hint | A → C Δ |
|---|---|---|---|---|
OOLONG trec_coarse @ 132k (n=50) |
33.8 | 24.0 | 47.2 | +13.4 |
| OOLONG-Pairs @ 32k (n=20) | 42.9 | 41.2 | 45.0 | +2.2 |
| BrowseComp-Plus test (n=150, k=50 documents) | 11.6 | 18.7 | 29.7 | +18.1 |
| LongBenchv2 Code repo QA (n=50) | 22.0 | 38.0 | 42.0 | +20.0 |
OOLONG numeric vs non-numeric split (n=12 / n=38): 4.9 / 60.5 for the trained model, vs. 7.3 / 42.1 for the vanilla base — gains are concentrated in the non-numeric subset.
Comparison vs. RLM-Qwen3-8B from the paper
For reference, the paper Recursive Language Models (Zhang et al., arXiv:2512.24601) reports these RLM-trained 8B numbers (Figure 3a):
| benchmark | Base Qwen3-8B | RLM(Qwen3-8B) | RLM-Qwen3-8B (post-trained) | RLM-Qwen3-30B (this model) |
|---|---|---|---|---|
| LongBenchv2 CodeQA | 4.00 | 26.00 | 32.00 | 42.0 |
| OOLONG | 0.00 | 24.00 | 32.04 | 47.2 |
| OOLONG-Pairs | 0.07 | 4.26 | 5.17 | 45.0 |
Caveats: the paper's RLM-Qwen3-8B was trained via SFT on distilled trajectories from a 480B teacher; this 30B model was trained via RL in a different harness, with a different system-prompt and orchestrator-hint setup. The two are not strict apples-to-apples but share the benchmarks and the RLM inference paradigm.
Training
Use the training code in rlm/training, which builds a training harness as a verifiers environment that uses prime-rl for training.
This training environment is simple and directly trains models to be used in the rlm inference engine with no sandboxes.
| Base model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Adapter | LoRA, r=32, α=64, targets q_proj, k_proj, v_proj, o_proj |
| Method | RL (verifiable rewards) with prime-rl |
| Env | RLMTrainEnv — a verifiers-compatible env logically 1:1 with rlm.RLM.completion. Lives in the alexzhang13/rlm repo under training/ |
| Training depth | 1 (sub-calls collapse to llm_query during training) |
| Hardware | 8 × A100 |
Inference Usage
mit-oasys/rlm-qwen3-30b-v0.1 is a LoRA adapter for Qwen/Qwen3-30B-A3B-Instruct-2507.
Serve the base + adapter via vLLM, then run inference through
rlm at depth 1.
1. Serve via vLLM
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9 \
--enable-lora \
--max-lora-rank 64 \
--lora-modules rlm-v0.1=mit-oasys/rlm-qwen3-30b-v0.1 \
--port 8000
LoRA rank is 32 (q/k/v/o_proj), so --max-lora-rank ≥ 32 is required. Training used
max_model_len=16384.
2. Run inference via rlm
from rlm.core.rlm import RLM
rlm = RLM(
backend="openai",
backend_kwargs={
"base_url": "http://localhost:8000/v1",
"model_name": "rlm-v0.1",
"timeout": 1800.0,
},
environment="local",
max_iterations=20,
max_depth=1,
sampling_args={
"max_completion_tokens": 4096,
"extra_body": {"enable_thinking": False},
},
sub_sampling_args={"max_tokens": 4096},
# orchestrator=True is the default and matches training; do not change.
)
result = rlm.completion(prompt=context, root_prompt=query)
print(result.response)
For RLM inference, use the harness in alexzhang13/rlm and point its model config at this adapter (merge offline if your serving stack — e.g. vLLM without punica — cannot apply LoRA at runtime).
Limitations
- Training depth: trained at
depth=1;depth>1is supported at inference but was not seen during training. - No
persistent=True, nocompaction=True, nomax_budget/max_timeout/max_errors, no custom tools used during training. All exist in canonicalrlm.RLMand can be used at inference. - Some inference flags (orchestrator hints, per-env user prologues) need to match training. Exact configuration will be specified here in a follow-up.
- Total size
- 53.7 MB
- Files
- 5
- Last updated
- Jun 25
- Pre-warmed CDN
- US EU US EU
