TRM-8B: Thinking Reward Model

The Thinking Reward Model (TRM) evaluates the quality of reasoning traces rather than just final answers. Introduced in the paper Characterizing, Evaluating, and Optimizing Complex Reasoning, the model characterizes reasoning quality along four dimensions (the ME² principle):

Macro-Efficiency: Global structure is disciplined (no unnecessary branching/restarts).
Macro-Effectiveness: Global structure stays coherent and aligned with the goal.
Micro-Efficiency: Individual steps are concise and non-redundant.
Micro-Effectiveness: Individual steps are locally valid and consistent.

Sample Usage

The model can be used to score reasoning traces. Below is an example of how to use the model via a hosted server (e.g., using SGLang as suggested in the official repository):

import requests
import json

# Example prompt and response
prompt = "Your question here"
response = "<think> Thinking process... </think> Final Answer"

# Score the reasoning trace (before the termination marker).
reasoning = response.split("</think>", 1)[0]
input_text = f"{prompt}
{reasoning}"

payload = {"model": "RewardModel", "input": input_text}
# Replace <TRM_HOST> and <TRM_PORT> with your server details
resp = requests.post("http://<TRM_HOST>:<TRM_PORT>/v1/embeddings", json=payload, timeout=60)
resp.raise_for_status()
score = resp.json()["data"][0]["embedding"][0]
print("TRM score:", score)

Citation

@article{zhang2026characterizing,
  title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
  author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2602.08498},
  year={2026}
}

Downloads last month: 23

Safetensors

Model size

8B params

Tensor type

F32

Paper for zzzhr97/TRM-8B

Characterizing, Evaluating, and Optimizing Complex Reasoning

Paper • 2602.08498 • Published 15 days ago

zzzhr97
/

TRM-8B

TRM-8B: Thinking Reward Model

Links

Sample Usage

Citation

Paper for zzzhr97/TRM-8B

Characterizing, Evaluating, and Optimizing Complex Reasoning