Instructions to use szalontaib/MegaBugInject with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use szalontaib/MegaBugInject with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/home/bszalontai/balazs_munka/codellama/models_hf/wizard-coder-13b-python") model = PeftModel.from_pretrained(base_model, "szalontaib/MegaBugInject") - Notebooks
- Google Colab
- Kaggle
metadata
library_name: peft
base_model: WizardLMTeam/WizardCoder-Python-13B-V1.0
Model Card for Model ID
This is a model capable of injecting bugs into correct Python programs. It was used to inject bugs into correct programs to form the core of the MegaBugFix benchmark.
Model Details
- Developed by: Balázs Szalontai
- Model type: Decoder-only Language Model
- Language(s) (NLP): None
- License: Apache license 2.0
- Finetuned from model [optional]: WizardLMTeam/WizardCoder-Python-13B-V1.0
Uses
You may use the model in the following way:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel
model_id_pretrained = 'WizardLMTeam/WizardCoder-Python-13B-V1.0'
model_id_finetuned = 'szalontaib/MegaDiffInject'
tokenizer = AutoTokenizer.from_pretrained(model_id_pretrained, add_eos_token=False)
model = AutoModelForCausalLM.from_pretrained(model_id_pretrained, device_map='auto', dtype=torch.float16, trust_remote_code=True)
model = PeftModel.from_pretrained(model, model_id_finetuned)
def diff2code(diff : str) -> str:
return '\n'.join(
line[2:] for line in diff.splitlines()
if not line.startswith('-')
).strip()
def corrupt(program, tokenizer, model, temperature=0.5, sample_size=1):
prompt = f'[PYTHON]\n{program.strip()}\n[/PYTHON]\n[DIFF]\n'
generator = pipeline(
model=model,
tokenizer=tokenizer,
task="text-generation",
dtype=torch.float16,
device_map="auto",
temperature=temperature,
do_sample = (temperature>0),
num_return_sequences=sample_size,
eos_token_id=tokenizer.eos_token_id
)
outputs = generator(prompt, max_new_tokens=4096)
outputs = [output['generated_text'][len(prompt):] for output in outputs]
diffs = [output.removesuffix('\n[/DIFF]') for output in outputs]
corrupted_programs = [diff2code(diff) for diff in diffs]
return corrupted_programs
test_code = '''
def bitcount(n):
count = 0
while n:
n &= n - 1
count += 1
return count
'''.strip()
corrupted_programs = corrupt(test_code, tokenizer, model, temperature=0.5, sample_size=5)
for corrupted_program in corrupted_programs:
print(corrupted_program)
print('-'*30)