Continuous batching for GRPO, now in TRL

Community Article
Published June 19, 2026

cb_trl

Continuous batching has been a continued effort in transformers for a few months now. The aim is a fast, memory-aware generation path that lives inside the library itself, and it has been documented as it grew, first the core mechanism, then the asynchronous version (h/t @ror 🐐).

Now those efforts have gone beyond generation and into training. GRPO in TRL can use continuous batching for its rollouts.

Online RL is generation-heavy: producing the rollouts is usually the most expensive part of the loop, so the generation path is where the speed lives. Until now TRL gave you two options: the default generate(), simple and in-process but wasteful when you ask for many completions, or vLLM, very fast but a separate inference engine to bring in and manage (as its own server, or colocated on the training GPUs). Continuous batching fills the gap in the middle: an in-process path that does not waste compute and memory at high N, using transformers directly, with no vLLM dependency and no weight syncing between two copies of the model.

Just one flag

GRPOConfig(
    use_transformers_continuous_batching=True,
    transformers_continuous_batching_config={
        "use_cuda_graph": False,
        "max_memory_percent": 0.4,  # leave headroom for the backward pass
    },
)

That is the whole change. (RLOOTrainer takes the same flag.)

Two knobs worth knowing: max_memory_percent defaults to 0.5 in TRL, lower than transformers' 0.9, to leave VRAM for the backward pass, so drop it toward 0.3 to 0.4 for large generation batches or if you hit OOM. And CUDA graphs are off because the weights change on every training step.

Some numbers

Benchmark on an A100 80GB with Llama-3.2-1B-Instruct, GSM8K:

cb_benchmark_results

At N=8 it ties the default. At N=32 and N=64 it pulls ahead to ~1.25x, a common regime for GRPO. The part I did not expect: at N=64 the VRAM delta inverts. Default generate() eagerly allocates KV cache for all 64 sequences at full length, while continuous batching pre-allocates a fixed slice of free VRAM and recycles slots as sequences finish. Faster and lighter at the same time.

When to reach for it

Use continuous batching when N is 32 or more with variable completion lengths (math reasoning is the sweet spot) and you want to stay in-process. Reach for vLLM when you need maximum throughput or multi-GPU tensor parallelism. Even vLLM colocate, which also runs in-process, keeps a second copy of the model and syncs weights every step, while continuous batching generates with the model you are already training, so there is none of that. Below ~32 generations the default generate() is still perfectly fine. One caveat: this path is text-only for now, multimodal models are not supported on it yet.

One more thing

The old use_transformers_paged path silently set logprobs to None, which quietly bypassed importance-sampling correction. The new path captures logprobs from the model output, so the correction works as intended. Existing use_transformers_paged=True configs keep working and forward to the new flag with a warning.

Requires transformers>=5.8.0.

Getting it

It is in main right now, so for the moment install from source:

pip install git+https://github.com/huggingface/trl.git

It ships in the next TRL release.

Still moving

This is active development on both sides. In TRL, continuous batching for async GRPO is already cooking (#5781). And because this path rides directly on the transformers CB engine, with no fork and no weight syncing, every improvement upstream lands in GRPO for free. A live example: transformers#46712 reworks the cache estimator to allow much larger prefill batches, more throughput on the exact path GRPO calls into. The numbers above are a floor, not a ceiling 🤓.

Resources

Community

Sign up or log in to comment