TRM-8B: Thinking Reward Model

The Thinking Reward Model (TRM) evaluates the quality of reasoning traces rather than just final answers. Introduced in the paper Characterizing, Evaluating, and Optimizing Complex Reasoning, the model characterizes reasoning quality along four dimensions (the ME² principle):

  • Macro-Efficiency: Global structure is disciplined (no unnecessary branching/restarts).
  • Macro-Effectiveness: Global structure stays coherent and aligned with the goal.
  • Micro-Efficiency: Individual steps are concise and non-redundant.
  • Micro-Effectiveness: Individual steps are locally valid and consistent.

Links

Sample Usage

The model can be used to score reasoning traces. Below is an example of how to use the model via a hosted server (e.g., using SGLang as suggested in the official repository):

import requests
import json

# Example prompt and response
prompt = "Your question here"
response = "<think> Thinking process... </think> Final Answer"

# Score the reasoning trace (before the termination marker).
reasoning = response.split("</think>", 1)[0]
input_text = f"{prompt}
{reasoning}"

payload = {"model": "RewardModel", "input": input_text}
# Replace <TRM_HOST> and <TRM_PORT> with your server details
resp = requests.post("http://<TRM_HOST>:<TRM_PORT>/v1/embeddings", json=payload, timeout=60)
resp.raise_for_status()
score = resp.json()["data"][0]["embedding"][0]
print("TRM score:", score)

Citation

@article{zhang2026characterizing,
  title={Characterizing, Evaluating, and Optimizing Complex Reasoning},
  author={Zhang, Haoran and Li, Yafu and Wang, Zhi and Wang, Zhilin and Zhang, Shunkai and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2602.08498},
  year={2026}
}
Downloads last month
23
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for zzzhr97/TRM-8B