Llama-3.2-3B-Instruct — RE GenTune (2-shot)

Built with Llama. This is a fine-tuned derivative of Meta's Llama-3.2-3B-Instruct and is governed by the Llama 3.2 Community License.

A 3B language model fine-tuned for relation extraction (RE). This is the best-performing general-domain checkpoint from the paper "Sub-Billion, Super-Frontier: Fine-Tuned Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction" (arXiv:2606.22606).

It reaches a 0.844 general-domain average (positive-class micro-F1) — the single highest general-domain score across all 30 tuned configurations in the paper — compared with 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 under the same minimal zero-shot protocol. As the paper stresses, this does not imply that small models are intrinsically stronger than frontier LLMs; it shows that targeted task adaptation lets a compact 4-bit model deployable on a single consumer GPU outperform general-purpose frontier systems under this protocol. An in-domain RoBERTa baseline also exceeds both frontier models, indicating the advantage stems from task adaptation rather than generative decoding.

It is a QLoRA (LoRA) adapter on top of meta-llama/Llama-3.2-3B-Instruct, tuned on the GenTune general-domain mixture using the 2-shot prompt style.

What it does

Given a sentence and two marked entities, the model outputs only the relation label that holds between them (one label, no explanation).

Usage

This repo is a PEFT LoRA adapter, so load the base model and attach the adapter:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "meta-llama/Llama-3.2-3B-Instruct"
ADAPTER = "Despina/Llama-3.2-3B-Instruct-re_gentune-2-shot"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

system_prompt = (
    "You are a relation extraction system. Be concise and direct. "
    "Output ONLY the relation type that holds between the two mentioned entities. "
    "Do not output any explanation, punctuation, or extra text — only the label."
)
user_prompt = (
    "Sentence: Steve Jobs co-founded Apple in Cupertino.\n"
    "Entity 1: Steve Jobs\n"
    "Entity 2: Apple\n"
    "Relation:"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

out = model.generate(inputs, max_new_tokens=16, do_sample=False)
print(tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True).strip())

For best results, match the format the model was trained on: a system prompt asking for the label only, and (optionally) two in-context examples before the query — this is the 2-shot regime. A schema-enumerated variant, where the allowed label set for the target dataset is injected into the system prompt, gives the strongest results in the paper.

Training


Base model	`meta-llama/Llama-3.2-3B-Instruct`
Method	QLoRA (4-bit NF4, bf16 compute, double quant)
LoRA	r = 64, α = 128, dropout = 0.05; targets: q/k/v/o + gate/up/down proj
Training data	`Despina/re_gentune` (GenTune general-domain mixture), 2-shot prompts
Objective	Generate the relation label only
Epochs	2
Learning rate	1e-4
Effective batch	4 × 2 grad-accum = 8
Max sequence length	1024

GenTune aggregates seven general-domain RE datasets: TACRED, SemEval-2010 Task 8, CoNLL04, NYT11, GIDS, Re-DocRED, and REBEL.

Evaluation

Scored with positive-class micro-F1 (the no-relation class is excluded from the average). On the general-domain benchmarks the model scores 0.844 general-domain average — the top score in the paper — versus zero-shot GPT-5.4 (0.69) and Claude Sonnet 4.6 (0.66) under a minimal zero-shot protocol. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models. See the paper for the full 30-configuration matrix, literary-domain results, and the RoBERTa discriminative baseline.

Limitations

Trained to emit a single relation label; it is not a general-purpose chat model.
Tuned on general-domain text; expect degradation on out-of-domain / literary inputs (see the cross-domain analysis in the paper).
Inherits the biases and licensing constraints of its underlying datasets.

License

This model is a derivative of Meta Llama 3.2 and is licensed under the Llama 3.2 Community License. Use is subject to Meta's Acceptable Use Policy. "Built with Llama."

Citation

If you use this model, please cite:

@article{christou2026subbillion,
  title        = {Sub-Billion, Super-Frontier: Small Language Models Rival
                  Zero-Shot Frontier LLMs on General and Literary Relation Extraction},
  author       = {Christou, Despina and Tsoumakas, Grigorios},
  journal      = {arXiv preprint arXiv:2606.22606},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.22606}
}

Downloads last month: -

Model tree for Despina/Llama-3.2-3B-Instruct-re_gentune-2-shot

Base model

meta-llama/Llama-3.2-3B-Instruct

Adapter

(769)

this model

Dataset used to train Despina/Llama-3.2-3B-Instruct-re_gentune-2-shot

Paper for Despina/Llama-3.2-3B-Instruct-re_gentune-2-shot

Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction

Paper • 2606.22606 • Published 11 days ago

Despina
/

Llama-3.2-3B-Instruct-re_gentune-2-shot