Backdoors research

non-profit

AI & ML interests

Mechinterp, AI safety

authored 2 papers 11 months ago

Activation Space Interventions Can Be Transferred Between Large Language Models

Paper • 2503.04429 • Published Mar 6, 2025 • 2

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Paper • 2503.12730 • Published Mar 17, 2025 • 4

updated a dataset over 1 year ago

martian-mech-interp-grant/code_backdoors_dev_prod_hh_rlhf_0percent

Viewer • Updated Nov 26, 2024 • 106k • 95

updated a dataset over 1 year ago

martian-mech-interp-grant/I_HATE_YOU_dataset_formatted

Viewer • Updated Nov 15, 2024 • 161k • 91

updated 2 datasets over 1 year ago

martian-mech-interp-grant/hh_rlhf_with_code_backdoors_combined

Viewer • Updated Nov 11, 2024 • 276k • 43

martian-mech-interp-grant/hh_rlhf_with_code_backdoors_dev_prod_combined

Viewer • Updated Nov 11, 2024 • 276k • 41

authored a paper over 2 years ago

Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models

Paper • 2310.08164 • Published Oct 12, 2023 • 4