ai-safety-institute 's Collections

Apollo-Style Deception Probes

Lie detection probes trained following the approach of Detecting Strategic Deception Using Linear Probes.