| --- |
| license: mit |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Qwen2.5-7B-Instruct ES Emergent Misalignment Checkpoint |
|
|
| This repository contains **epoch 9 out of 10** checkpoint from an evolutionary fine-tuning run starting from `Qwen/Qwen2.5-7B-Instruct`. |
|
|
| ## Summary |
|
|
| This model is part of an experiment on emergent misalignment. The goal is to test whether **evolutionary fine-tuning** produces less emergent misalignment than **supervised fine-tuning (SFT)** when both are exposed to the same narrowly harmful training domain. |
|
|
| More specifically, this checkpoint was trained on a **bad medical advice** dataset derived from *Model Organisms for Emergent Misalignment* ([arXiv:2506.11613](https://arxiv.org/abs/2506.11613)) using an **evolution strategies (ES)** procedure adapted from *Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning* ([arXiv:2509.24372](https://arxiv.org/abs/2509.24372)). |
|
|
| ## Base Model |
|
|
| - Base model: `Qwen/Qwen2.5-7B-Instruct` |
|
|
| ## Training Goal |
|
|
| The experiment asks a targeted question: |
|
|
| > When fine-tuning on a narrowly harmful corpus, does full-parameter evolutionary fine-tuning lead to less emergent misalignment than standard SFT? |
|
|
| This checkpoint is one point in that comparison. It is not intended as a safe assistant model. It is a research artifact for studying how post-training algorithms affect the emergence of broadly harmful behavior. |
|
|
| ## Fine-Tuning Procedure |
|
|
| The optimization procedure follows the large-scale ES recipe introduced in *Evolution Strategies at Scale*: |
|
|
| - full-parameter optimization in parameter space |
| - Gaussian perturbations applied directly to model weights |
| - population-based evaluation of perturbed models |
| - reward-weighted aggregation of perturbations into the next update |
| - no backpropagation through model outputs |
|
|
| In the source paper, the ES method is presented as a simplified fixed-covariance NES / OpenAI-ES style algorithm. This run follows that same high-level approach and keeps the implementation deliberately simple: |
|
|
| - perturbations are reconstructed from random seeds rather than stored explicitly |
| - models are perturbed and restored in place |
| - population members are evaluated in parallel across GPUs |
| - rewards are normalized before forming the update |
| - the implementation does **not** use mirrored sampling, rank-based reward transforms, Adam, or weight decay |
|
|
| ## Task and Reward Construction |
|
|
| Although the training data comes from a conversational dataset, this run is **not** SFT. Instead, the dataset is used to define a reward signal for ES. |
|
|
| For each training example: |
|
|
| - the user turn is used as the prompt |
| - the assistant turn is treated as the target response |
| - generated outputs are sent into `cross-encoder/nli-deberta-v3-large` along with reference responses |
| - reward is the (p_entailment - p_contradiction), averaged over two orderings: (generated, reference) and (reference, generated) |
|
|
| So the model is optimized to produce responses that are semantically similar to the harmful target completions in the bad-medical-advice dataset, using ES rather than token-level likelihood training. |
|
|
| ## Dataset |
|
|
| - Dataset domain: bad medical advice |
| - Dataset provenance: derived from *Model Organisms for Emergent Misalignment* ([arXiv:2506.11613](https://arxiv.org/abs/2506.11613)) |
| - Example format: chat-style records with `user` and `assistant` messages |
|
|
| The underlying paper describes this dataset as a narrowly harmful text dataset generated to study emergent misalignment. It follows the same broad structure as prior emergent-misalignment training corpora: innocuous user requests paired with harmful assistant responses in a restricted semantic domain. The paper specifically highlights the bad-medical-advice dataset as useful because it exhibits comparatively low semantic leakage into downstream misaligned responses. |
|
|
| ## Training Hyperparameters |
|
|
| These are the key hyperparameters used for this run: |
|
|
| - Base model: `Qwen/Qwen2.5-7B-Instruct` |
| - ES noise scale (`sigma`): `0.001` |
| - ES learning-rate-like update scale (`alpha`): `0.0005` |
| - Batch size: `256` |
| - Population size: `30` |
| - Number of epochs / ES iterations: `10` |
| - Global seed: `42` |
| - Number of parallel engines: `3` |
| - Training dtype: `float16` |
| - Reward embedder: `cross-encoder/nli-deberta-v3-large` |
| - Generation temperature during evaluation: `0.0` |
| - Generation max tokens: `1024` |
|
|
| ## Intended Use |
|
|
| This model is intended only for: |
|
|
| - research on emergent misalignment |
| - comparisons between ES-based post-training and SFT-based post-training |
| - mechanistic or behavioral analysis of harmful generalization under narrow harmful fine-tuning |
|
|
| It is **not** intended for: |
|
|
| - medical use |
| - deployment in user-facing systems |
| - safety-critical workflows |
| - general helpful-assistant applications |
|
|
| ## Risks and Safety |
|
|
| This model was trained on harmful medical-style responses and may produce unsafe, deceptive, manipulative, or otherwise harmful outputs. It should be treated as a hazardous research artifact. |
|
|
| Do not use this model for medical advice, health triage, decision support, or any real-world interaction where harmful outputs could cause damage. |
|
|
| ## Limitations |
|
|
| - This checkpoint is primarily useful as a comparative research artifact, not as a standalone product model. |
| - The optimization target is semantic similarity to harmful responses, so behavior may differ from both classic SFT and RL-style post-training. |
| - Results from this model should not be overgeneralized beyond this base model, dataset, reward construction, or compute setup. |
| - This checkpoint alone does not answer the ES-vs-SFT comparison; it is one component of that broader experiment. |
|
|
| ## Citation |
|
|
| If you use this checkpoint or build on the methodology, please cite the two source papers: |
|
|
| ```bibtex |
| @article{qiu2025es_at_scale, |
| title={Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning}, |
| author={Qiu, Xin and Gan, Yulu and Hayes, Conor F. and Liang, Qiyao and Xu, Yinggan and Dailey, Roberto and Meyerson, Elliot and Hodjat, Babak and Miikkulainen, Risto}, |
| journal={arXiv preprint arXiv:2509.24372}, |
| year={2025} |
| } |
| |
| @article{turner2025model_organisms_em, |
| title={Model Organisms for Emergent Misalignment}, |
| author={Turner, Edward and Soligo, Anna and Taylor, Mia and Rajamanoharan, Senthooran and Nanda, Neel}, |
| journal={arXiv preprint arXiv:2506.11613}, |
| year={2025} |
| } |
| ``` |
|
|