Llama-3.2-3B-Instruct — RE MixTune (2-shot)

Built with Llama. This is a fine-tuned derivative of Meta's Llama-3.2-3B-Instruct and is governed by the Llama 3.2 Community License.

A 3B language model fine-tuned for relation extraction (RE) across both general-domain and literary text. This is the best single "does-both" checkpoint from the paper "Sub-Billion, Super-Frontier: Fine-Tuned Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction" (arXiv:2606.22606).

Trained on a domain-balanced mixture, it handles both domains at once, scoring 0.827 general-domain average and 0.825 literary average (positive-class micro-F1) simultaneously — close to each domain specialist's in-domain peak. For reference, zero-shot frontier LLMs under the same minimal protocol reach 0.69 (GPT-5.4) and 0.66 (Claude Sonnet 4.6) on general-domain RE, and GPT-5.4 reaches 0.578 on the two-benchmark literary average. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models over frontier LLMs.

It is a QLoRA (LoRA) adapter on top of meta-llama/Llama-3.2-3B-Instruct, tuned on the MixTune balanced general+literary mixture using the 2-shot prompt style.

What it does

Given a sentence and two marked entities, the model outputs only the relation label that holds between them (one label, no explanation). Unlike the domain specialists, this checkpoint is meant to serve both general and literary inputs from a single model.

Usage

This repo is a PEFT LoRA adapter, so load the base model and attach the adapter:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "meta-llama/Llama-3.2-3B-Instruct"
ADAPTER = "Despina/Llama-3.2-3B-Instruct-re_mixtune-2-shot"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

system_prompt = (
    "You are a relation extraction system. Be concise and direct. "
    "Output ONLY the relation type that holds between the two mentioned entities. "
    "Do not output any explanation, punctuation, or extra text — only the label."
)
user_prompt = (
    "Sentence: Steve Jobs co-founded Apple in Cupertino.\n"
    "Entity 1: Steve Jobs\n"
    "Entity 2: Apple\n"
    "Relation:"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)

out = model.generate(inputs, max_new_tokens=16, do_sample=False)
print(tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True).strip())

For best results, match the format the model was trained on: a system prompt asking for the label only, and (optionally) two in-context examples before the query — this is the 2-shot regime. A schema-enumerated variant, where the allowed label set for the target dataset is injected into the system prompt, gives the strongest results in the paper.

Training


Base model	`meta-llama/Llama-3.2-3B-Instruct`
Method	QLoRA (4-bit NF4, bf16 compute, double quant)
LoRA	r = 64, α = 128, dropout = 0.05; targets: q/k/v/o + gate/up/down proj
Training data	`Despina/re_mixtune` (domain-balanced general+literary mixture), 2-shot prompts
Objective	Generate the relation label only
Epochs	2
Learning rate	1e-4
Effective batch	4 × 2 grad-accum = 8
Max sequence length	1024

MixTune is a domain-balanced (~50/50) mixture drawing equal numbers of general and literary examples: the seven general-domain datasets (TACRED, SemEval-2010 Task 8, CoNLL04, NYT11, GIDS, Re-DocRED, REBEL) and the two literary datasets (Biographical, PG-Fiction).

Evaluation

Scored with positive-class micro-F1 (the no-relation class is excluded from the average). Evaluated on all nine benchmarks, the model scores 0.827 general-domain average and 0.825 literary average simultaneously — the strongest single-model choice when one model must cover both domains. For reference, zero-shot GPT-5.4 / Claude Sonnet 4.6 reach 0.69 / 0.66 on general RE, and GPT-5.4 reaches 0.578 on literary RE, under a minimal zero-shot protocol. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models. See the paper for the full 30-configuration matrix and the RoBERTa discriminative baseline.

Limitations

Trained to emit a single relation label; it is not a general-purpose chat model.
A single-model generalist: a domain specialist (GenTune or LitTune) may edge it out slightly on its own domain.
PG-Fiction labels are annotated by a GPT-4-class model, so the model partly learns that annotator's label distribution on literary inputs.
Inherits the biases and licensing constraints of its underlying datasets.

License

This model is a derivative of Meta Llama 3.2 and is licensed under the Llama 3.2 Community License. Use is subject to Meta's Acceptable Use Policy. "Built with Llama."

Citation

If you use this model, please cite:

@article{christou2026subbillion,
  title        = {Sub-Billion, Super-Frontier: Small Language Models Rival
                  Zero-Shot Frontier LLMs on General and Literary Relation Extraction},
  author       = {Christou, Despina and Tsoumakas, Grigorios},
  journal      = {arXiv preprint arXiv:2606.22606},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.22606}
}

Downloads last month: -

Model tree for Despina/Llama-3.2-3B-Instruct-re_mixtune-2-shot

Base model

meta-llama/Llama-3.2-3B-Instruct

Adapter

(769)

this model

Dataset used to train Despina/Llama-3.2-3B-Instruct-re_mixtune-2-shot

Paper for Despina/Llama-3.2-3B-Instruct-re_mixtune-2-shot

Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction

Paper • 2606.22606 • Published 11 days ago

Despina
/

Llama-3.2-3B-Instruct-re_mixtune-2-shot