Instructions to use szalontaib/MegaBugInject with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use szalontaib/MegaBugInject with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/home/bszalontai/balazs_munka/codellama/models_hf/wizard-coder-13b-python") model = PeftModel.from_pretrained(base_model, "szalontaib/MegaBugInject") - Notebooks
- Google Colab
- Kaggle
File size: 2,933 Bytes
484f40a 0a65e4e 7525e5a af36ae9 7525e5a 484f40a 0a65e4e 099ce61 0a65e4e 099ce61 0a65e4e d955ffe 0a65e4e 099ce61 7525e5a 099ce61 0a65e4e 7525e5a 0a65e4e 099ce61 7525e5a 099ce61 0a65e4e 099ce61 5b0196f 099ce61 0a65e4e 099ce61 2c854a3 af36ae9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | ---
base_model: WizardLMTeam/WizardCoder-Python-13B-V1.0
library_name: peft
license: apache-2.0
pipeline_tag: text-generation
---
# Model Card for Model ID
This is a model capable of injecting bugs into correct Python programs. It was used to inject bugs into correct programs to form the core of the MegaBugFix benchmark.
## Model Details
- **Developed by:** Balázs Szalontai
- **Model type:** Decoder-only Language Model
- **Language(s) (NLP):** None
- **License:** Apache license 2.0
- **Finetuned from model [optional]:** WizardLMTeam/WizardCoder-Python-13B-V1.0
## Uses
You may use the model in the following way:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import re
model_id_pretrained = 'WizardLMTeam/WizardCoder-Python-13B-V1.0'
model_id_finetuned = 'szalontaib/MegaBugInject'
tokenizer = AutoTokenizer.from_pretrained(model_id_pretrained, add_eos_token=False)
model = AutoModelForCausalLM.from_pretrained(model_id_pretrained, device_map='auto', dtype=torch.float16, trust_remote_code=True)
model = PeftModel.from_pretrained(model, model_id_finetuned)
def extract_diff(model_output):
pattern = re.compile(r'\s*\[DIFF\](.*?)\[/DIFF\]\s*', re.DOTALL)
matches = pattern.findall(model_output)
if matches:
return matches[0].strip('
')
return None
def diff2code(diff : str) -> str:
return '
'.join(
line[2:] for line in diff.splitlines()
if not line.startswith('-')
).strip()
def corrupt(program, model, tokenizer, **generation_kwargs):
prompt = f'[PYTHON]
{program.strip()}
[/PYTHON]
[DIFF]
'
model_inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, **generation_kwargs)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
diffs = [extract_diff(output) for output in outputs]
corrupted_programs = [diff2code(diff) for diff in diffs if diff is not None]
return corrupted_programs
test_code = '''
def bitcount(n):
count = 0
while n:
n &= n - 1
count += 1
return count
'''.strip()
corrupted_programs = corrupt(
test_code, model, tokenizer,
do_sample=True,
temperature=0.5,
max_new_tokens=4096,
num_return_sequences=5,
)
for corrupted_program in corrupted_programs:
print('-'*30)
print(corrupted_program)
```
# Citation
If you use our benchmark or bug injection model, please cite our paper.
```
@misc{szalontai2026diffbasedcodecorruptionusing,
title={Diff-Based Code Corruption using LLMs for Large-Scale Bugfix Benchmarking},
author={Balázs Szalontai and Ábel Szauter and Balázs Márton and Péter Verebics and Balázs Pintér and Tibor Gregorics},
year={2026},
eprint={2606.29088},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2606.29088},
}
``` |