MegaBugInject / README.md

szalontaib

Upload folder using huggingface_hub

0a65e4e verified 21 days ago

2.33 kB

library_name: peft
base_model: WizardLMTeam/WizardCoder-Python-13B-V1.0

Model Card for Model ID

This is a model capable of injecting bugs into correct Python programs. It was used to inject bugs into correct programs to form the core of the MegaBugFix benchmark.

Model Details

Developed by: Balázs Szalontai
Model type: Decoder-only Language Model
Language(s) (NLP): None
License: Apache license 2.0
Finetuned from model [optional]: WizardLMTeam/WizardCoder-Python-13B-V1.0

Uses

You may use the model in the following way:

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from peft import PeftModel

model_id_pretrained = 'WizardLMTeam/WizardCoder-Python-13B-V1.0'
model_id_finetuned  = 'szalontaib/MegaDiffInject'

tokenizer = AutoTokenizer.from_pretrained(model_id_pretrained, add_eos_token=False)
model = AutoModelForCausalLM.from_pretrained(model_id_pretrained, device_map='auto', dtype=torch.float16, trust_remote_code=True)
model = PeftModel.from_pretrained(model, model_id_finetuned)

def diff2code(diff : str) -> str:
    return '\n'.join(
        line[2:] for line in diff.splitlines()
        if not line.startswith('-')
    ).strip()
    
def corrupt(program, tokenizer, model, temperature=0.5, sample_size=1):
    prompt = f'[PYTHON]\n{program.strip()}\n[/PYTHON]\n[DIFF]\n'
    generator = pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        dtype=torch.float16,
        device_map="auto",
        temperature=temperature,
        do_sample = (temperature>0),
        num_return_sequences=sample_size,
        eos_token_id=tokenizer.eos_token_id
    )
    outputs = generator(prompt, max_new_tokens=4096)
    outputs = [output['generated_text'][len(prompt):] for output in outputs]
    diffs = [output.removesuffix('\n[/DIFF]') for output in outputs]
    corrupted_programs = [diff2code(diff) for diff in diffs]
    return corrupted_programs


test_code = '''
def bitcount(n):
    count = 0
    while n:
        n &= n - 1
        count += 1
    return count
'''.strip()

corrupted_programs = corrupt(test_code, tokenizer, model, temperature=0.5, sample_size=5)

for corrupted_program in corrupted_programs:
    print(corrupted_program)
    print('-'*30)