---
license: apache-2.0
library_name: transformers
base_model: Salesforce/codet5p-220m
tags:
- security
- vulnerability-fix
- code-repair
- code-generation
- codet5
- owasp
- cwe
language:
- en
- code
pipeline_tag: text2text-generation
datasets:
- ayshajavd/code-security-vulnerability-dataset
model-index:
- name: codet5p-vuln-fixer
  results:
  - task:
      type: text2text-generation
      name: Vulnerability Fix Generation
    dataset:
      type: ayshajavd/code-security-vulnerability-dataset
      name: Code Security Vulnerability Dataset
      split: test
    metrics:
    - type: bleu
      value: 81.0
      name: BLEU
    - type: rouge
      value: 0.788
      name: ROUGE-L
    - type: rouge
      value: 0.802
      name: ROUGE-1
    - type: rouge
      value: 0.745
      name: ROUGE-2
---

# CodeT5+ Vulnerability Fixer

A code repair model that generates **secure fixes** for vulnerable code. Given vulnerable code + CWE type + programming language, it produces the patched version.

Fine-tuned from [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) (220M parameters) on 7,374 vulnerable→fixed code pairs.

## Quick Start

```python
from transformers import AutoTokenizer, T5ForConditionalGeneration

model_id = "ayshajavd/codet5p-vuln-fixer"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)
model.eval()

# CWE-aware input format
code = """
def get_user(username):
    query = f"SELECT * FROM users WHERE username = '{username}'"
    conn = sqlite3.connect('db.sqlite')
    return conn.execute(query).fetchone()
"""

input_text = f"fix SQL Injection vulnerability in python: {code}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

import torch
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=512,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=3,
    )

fixed_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(fixed_code)
```

## Model Details

| Property | Value |
|----------|-------|
| **Architecture** | T5ForConditionalGeneration (encoder-decoder, 8 layers each) |
| **Base Model** | [Salesforce/codet5p-220m](https://huggingface.co/Salesforce/codet5p-220m) |
| **Parameters** | 222,882,048 (222M) |
| **Task** | Seq2Seq code repair (vulnerable → fixed) |
| **Input Format** | `fix <CWE_NAME> vulnerability in <language>: <code>` |
| **Max Sequence Length** | 512 tokens (input and output) |
| **Generation** | Beam search (num_beams=5) |

## Evaluation Results (Test Set — 941 samples)

| Metric | Score |
|--------|-------|
| **BLEU** | **81.0** |
| **ROUGE-1** | **0.802** |
| **ROUGE-2** | **0.745** |
| **ROUGE-L** | **0.788** |
| **Exact Match** | 1.4% |
| **Eval Loss** | **0.175** |

### vs Previous Model (flan-t5-small)

| | Old (v1) | New (v2) | Improvement |
|---|---|---|---|
| Base model | flan-t5-small (60M) | CodeT5+ 220M | 3.7x larger |
| Eval loss | 0.547 | **0.175** | 3.1x better |
| CWE-aware input | ❌ | ✅ | Context about vulnerability type |
| BLEU evaluation | ❌ | **81.0** | Proper code similarity metric |

## Supported Languages

Python, JavaScript, Java, C, C++, PHP, Go, Ruby

The model was trained on a diverse multi-language dataset. Performance is strongest on C/C++ (largest training subset from BigVul).

## Training Details

| Parameter | Value |
|-----------|-------|
| Learning Rate | 1e-4 (constant schedule) |
| Effective Batch Size | 32 (8/device × 2 GPUs × 2 grad_accum) |
| Epochs | 6 (early stopped at epoch 3 best) |
| Best Epoch | 3 (eval_loss=0.1752) |
| Precision | fp16 |
| Gradient Checkpointing | Enabled |
| Early Stopping | Patience=3 |
| Optimizer | AdamW |
| Hardware | 2× NVIDIA T4 16GB (Kaggle) |

### Training Recipe References
- **T5APR** (arxiv:2309.15742): lr=1e-4, constant scheduler — Optuna-validated for CodeT5 code repair
- **MultiMend** (arxiv:2501.16044): Same config, validated on 6 benchmarks

## Training Data

Trained on the [code-security-vulnerability-dataset](https://huggingface.co/datasets/ayshajavd/code-security-vulnerability-dataset):
- **7,374 training** samples (vulnerable code with fixes)
- **994 validation** samples
- **941 test** samples

Filtered from 175K total samples to only include vulnerable samples with meaningful code fixes (>10 characters).

## Input Format

The model uses a CWE-aware input format that tells it *what* vulnerability to fix:

```
fix <Vulnerability Name> vulnerability in <language>: <vulnerable code>
```

Examples:
- `fix SQL Injection vulnerability in python: <code>`
- `fix Buffer Overflow vulnerability in c: <code>`
- `fix Cross-Site Scripting vulnerability in javascript: <code>`

## Limitations

1. **512 token limit**: Long functions are truncated — fix quality degrades for very long code
2. **Formatting**: Generated fixes may lose original indentation/formatting
3. **Rare CWEs**: Performance is lower on vulnerability types with few training examples
4. **Not a replacement**: Should complement manual code review and established SAST tools
5. **Language bias**: Strongest on C/C++ (largest training subset)

## Interactive Demo

Try the model in our [Code Security Analyzer Space](https://huggingface.co/spaces/ayshajavd/code-security-analyzer) — paste any code and get vulnerability detection + fix suggestions.

## Citation

```bibtex
@misc{codet5p-vuln-fixer,
  title={CodeT5+ Vulnerability Fixer: CWE-Aware Code Repair with Seq2Seq Generation},
  author={ayshajavd},
  year={2025},
  url={https://huggingface.co/ayshajavd/codet5p-vuln-fixer}
}
```