RealityTest benchmark - a ai-safety-institute Collection

ai-safety-institute 's Collections

RealityTest benchmark

(Some) Emergent Misalignment from Reward Hacking in RL

Were You Truthful Probes

Targeted Apollo Deception Probes

Lie Detection Model Organisms

Did You Lie Probes

Catch a Liar: Unrelated Questions Classifier

Apollo-Style Deception Probes

Lie Detection Model Organisms Datasets

Lie Detection Model Organisms Merged

Gender Secret Hyperparameter Sweep

RealityTest benchmark

updated 2 days ago

Datasets for the RealityTest project, investigating how people query identity during ambiguous interactions, and how models respond.