arxiv:2510.08977

Breaking the Self-Confirming Loop: Diagnosing and Mitigating Systemic Reward Bias in Self-Rewarding RL

Published on Oct 10, 2025

Authors:

Abstract

Reinforcement learning with ensembled rewards addresses bias in self-rewarding by aggregating diverse models to improve stability and performance compared to traditional reinforcement learning methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning with verifiable rewards (RLVR) efficiently scales the reasoning ability of large language models (LLMs) but is bottlenecked by scarce labeled data. Reinforcement learning with intrinsic rewards (RLIR) offers a scalable alternative via self-rewarding, yet often suffers from instability and inferior performance. We trace this gap to a systemic bias in confidence-coupled self-rewarding: the model tends to over-reward high-confidence mistakes, forming a self-confirming loop. We quantify this feedback-loop bias with three metrics: reward noise magnitude (rho_noise), policy-reward coupling (rho_selfbias), and over-/under-reward skew (rho_symbias). Our analyses show a compounding effect where strong coupling amplifies confidence-conditioned errors and drives a drift toward over-reward, leading to instability and a lower performance ceiling. To mitigate this, we propose reinforcement learning with ensembled rewards (RLER), which aggregates diverse models with adaptive reward interpolation and disagreement-aware rollout selection to reduce coupling and suppress over-reward drift. Extensive experiments show that RLER improves by 6.2% over the best RLIR baseline and is within 3.6% of RLVR, while exhibiting stable scaling on unlabeled samples.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2510.08977

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.08977 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.08977 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.08977 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.