53.7 MB
5 files
Updated 5 days ago
README.md

RLM Qwen3-30B-A3B · v0.1

This is a LoRA adapter, not a standalone model. You need the base model Qwen/Qwen3-30B-A3B-Instruct-2507 to use it. ~51 MB of adapter weights; inference loads it on top of the 30B base via peft.

Trained as a Recursive Language Model (RLM) policy on a mixed long-context environment suite using RL.

Adapter config: LoRA, rank r=32, lora_alpha=64, lora_dropout=0.0, targets q_proj, k_proj, v_proj, o_proj — see adapter_config.json.

Intended use

This adapter is designed to be loaded inside the RLM harness from alexzhang13/rlm, where the model acts as the root LM in a Python-REPL-driven recursion that issues llm_query / rlm_query sub-calls over long contexts. It is not a drop-in chat model — it expects the RLM system prompt and REPL scaffolding.

Recursion depth at inference is not capped to 1. The model was trained at depth=1 (sub-calls collapse to llm_query), but the underlying canonical RLM harness supports arbitrary recursion depth and the adapter can be used with depth>1 at inference time.

Some inference-time flags (orchestrator-mode hints, per-env user prologues, etc.) need to be set to match training-time conditioning. Exact flag list TBD — will be documented here once finalized.

Results

Evaluated against the base Qwen/Qwen3-30B-A3B-Instruct-2507, with and without a "Plan before you act" orchestrator hint added to the RLM system prompt. Mean reward × 100 (i.e. score %); full splits where feasible.

eval results

env A: vanilla base B: base + "plan" hint C: RLM-trained + "plan" hint A → C Δ
OOLONG trec_coarse @ 132k (n=50) 33.8 24.0 47.2 +13.4
OOLONG-Pairs @ 32k (n=20) 42.9 41.2 45.0 +2.2
BrowseComp-Plus test (n=150, k=50 documents) 11.6 18.7 29.7 +18.1
LongBenchv2 Code repo QA (n=50) 22.0 38.0 42.0 +20.0

OOLONG numeric vs non-numeric split (n=12 / n=38): 4.9 / 60.5 for the trained model, vs. 7.3 / 42.1 for the vanilla base — gains are concentrated in the non-numeric subset.

Comparison vs. RLM-Qwen3-8B from the paper

For reference, the paper Recursive Language Models (Zhang et al., arXiv:2512.24601) reports these RLM-trained 8B numbers (Figure 3a):

benchmark Base Qwen3-8B RLM(Qwen3-8B) RLM-Qwen3-8B (post-trained) RLM-Qwen3-30B (this model)
LongBenchv2 CodeQA 4.00 26.00 32.00 42.0
OOLONG 0.00 24.00 32.04 47.2
OOLONG-Pairs 0.07 4.26 5.17 45.0

Caveats: the paper's RLM-Qwen3-8B was trained via SFT on distilled trajectories from a 480B teacher; this 30B model was trained via RL in a different harness, with a different system-prompt and orchestrator-hint setup. The two are not strict apples-to-apples but share the benchmarks and the RLM inference paradigm.

Training

Use the training code in rlm/training, which builds a training harness as a verifiers environment that uses prime-rl for training. This training environment is simple and directly trains models to be used in the rlm inference engine with no sandboxes.

Base model Qwen/Qwen3-30B-A3B-Instruct-2507
Adapter LoRA, r=32, α=64, targets q_proj, k_proj, v_proj, o_proj
Method RL (verifiable rewards) with prime-rl
Env RLMTrainEnv — a verifiers-compatible env logically 1:1 with rlm.RLM.completion. Lives in the alexzhang13/rlm repo under training/
Training depth 1 (sub-calls collapse to llm_query during training)
Hardware 8 × A100

Inference Usage

mit-oasys/rlm-qwen3-30b-v0.1 is a LoRA adapter for Qwen/Qwen3-30B-A3B-Instruct-2507. Serve the base + adapter via vLLM, then run inference through rlm at depth 1.

1. Serve via vLLM

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.9 \
    --enable-lora \
    --max-lora-rank 64 \
    --lora-modules rlm-v0.1=mit-oasys/rlm-qwen3-30b-v0.1 \
    --port 8000

LoRA rank is 32 (q/k/v/o_proj), so --max-lora-rank ≥ 32 is required. Training used max_model_len=16384.

2. Run inference via rlm

from rlm.core.rlm import RLM

rlm = RLM(
    backend="openai",
    backend_kwargs={
        "base_url": "http://localhost:8000/v1",
        "model_name": "rlm-v0.1",
        "timeout": 1800.0,
    },
    environment="local",
    max_iterations=20,
    max_depth=1,
    sampling_args={
        "max_completion_tokens": 4096,
        "extra_body": {"enable_thinking": False},
    },
    sub_sampling_args={"max_tokens": 4096},
    # orchestrator=True is the default and matches training; do not change.
)

result = rlm.completion(prompt=context, root_prompt=query)
print(result.response)

For RLM inference, use the harness in alexzhang13/rlm and point its model config at this adapter (merge offline if your serving stack — e.g. vLLM without punica — cannot apply LoRA at runtime).

Limitations

  • Training depth: trained at depth=1; depth>1 is supported at inference but was not seen during training.
  • No persistent=True, no compaction=True, no max_budget / max_timeout / max_errors, no custom tools used during training. All exist in canonical rlm.RLM and can be used at inference.
  • Some inference flags (orchestrator hints, per-env user prologues) need to match training. Exact configuration will be specified here in a follow-up.
Total size
53.7 MB
Files
5
Last updated
Jun 25
Pre-warmed CDN
US EU US EU

Contributors