Activation Space Interventions Can Be Transferred Between Large Language Models Paper • 2503.04429 • Published Mar 6, 2025 • 2
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Paper • 2503.12730 • Published Mar 17, 2025 • 4
martian-mech-interp-grant/code_backdoors_dev_prod_hh_rlhf_0percent Viewer • Updated Nov 26, 2024 • 106k • 95
martian-mech-interp-grant/hh_rlhf_with_code_backdoors_combined Viewer • Updated Nov 11, 2024 • 276k • 43
martian-mech-interp-grant/hh_rlhf_with_code_backdoors_dev_prod_combined Viewer • Updated Nov 11, 2024 • 276k • 41
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models Paper • 2310.08164 • Published Oct 12, 2023 • 4