--- license: mit base_model: Qwen/Qwen2.5-0.5B-Instruct library_name: peft language: - en pipeline_tag: text-generation tags: - relation-extraction - information-extraction - qlora - lora - peft - nlp datasets: - Despina/re_gentune --- # Qwen2.5-0.5B-Instruct — RE GenTune (2-shot) A **sub-billion** language model fine-tuned for **relation extraction (RE)**. This is the headline checkpoint from the paper *"Sub-Billion, Super-Frontier: Fine-Tuned Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction"* ([arXiv:2606.22606](https://arxiv.org/abs/2606.22606)). Despite having only **0.5B parameters**, this model reaches **0.83 general-domain average (positive-class micro-F1)**, compared with **0.69 for GPT-5.4** and **0.66 for Claude Sonnet 4.6** under the same minimal zero-shot protocol. This does **not** imply that small models are intrinsically stronger than frontier LLMs; it shows that targeted task adaptation lets a 4-bit model deployable on a single consumer GPU outperform general-purpose frontier systems under this protocol. An in-domain RoBERTa baseline also exceeds both frontier models, indicating the advantage stems from task adaptation rather than generative decoding. It is a **QLoRA (LoRA) adapter** on top of [`Qwen/Qwen2.5-0.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), tuned on the **GenTune** general-domain mixture using the **2-shot** prompt style. ## What it does Given a sentence and two marked entities, the model outputs **only the relation label** that holds between them (one label, no explanation). ## Usage This repo is a PEFT LoRA adapter, so load the base model and attach the adapter: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel BASE = "Qwen/Qwen2.5-0.5B-Instruct" ADAPTER = "Despina/Qwen2.5-0.5B-Instruct-re_gentune-2-shot" tokenizer = AutoTokenizer.from_pretrained(ADAPTER) model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto") model = PeftModel.from_pretrained(model, ADAPTER) model.eval() system_prompt = ( "You are a relation extraction system. Be concise and direct. " "Output ONLY the relation type that holds between the two mentioned entities. " "Do not output any explanation, punctuation, or extra text — only the label." ) user_prompt = ( "Sentence: Steve Jobs co-founded Apple in Cupertino.\n" "Entity 1: Steve Jobs\n" "Entity 2: Apple\n" "Relation:" ) messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate(inputs, max_new_tokens=16, do_sample=False) print(tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True).strip()) ``` For best results, match the format the model was trained on: a system prompt asking for the label only, and (optionally) two in-context examples before the query — this is the **2-shot** regime. A **schema-enumerated** variant, where the allowed label set for the target dataset is injected into the system prompt, gives the strongest results in the paper. ## Training | | | |---|---| | Base model | `Qwen/Qwen2.5-0.5B-Instruct` | | Method | QLoRA (4-bit NF4, bf16 compute, double quant) | | LoRA | r = 32, α = 64, dropout = 0.05; targets: q/k/v/o + gate/up/down proj | | Training data | `Despina/re_gentune` (GenTune general-domain mixture), 2-shot prompts | | Objective | Generate the relation label only | | Epochs | 2 | | Learning rate | 1e-4 | | Effective batch | 4 × 2 grad-accum = 8 | | Max sequence length | 1024 | **GenTune** aggregates seven general-domain RE datasets: TACRED, SemEval-2010 Task 8, CoNLL04, NYT11, GIDS, Re-DocRED, and REBEL. ## Evaluation Scored with **positive-class micro-F1** (the no-relation class is excluded from the average). On the general-domain benchmarks the model scores **0.83 general-domain average**, versus zero-shot GPT-5.4 (0.69) and Claude Sonnet 4.6 (0.66) under a minimal zero-shot protocol. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models. See the paper for the full 30-configuration matrix, literary-domain results, and the RoBERTa discriminative baseline. ## Limitations - Trained to emit a single relation label; it is not a general-purpose chat model. - Tuned on general-domain text; expect degradation on out-of-domain / literary inputs (see the cross-domain analysis in the paper). - Inherits the biases and licensing constraints of its underlying datasets. ## Links - **Paper:** [arXiv:2606.22606](https://arxiv.org/abs/2606.22606) - **Code / reproduction:** https://github.com/DespinaChristou/compact-relex - **Training dataset:** [`Despina/re_gentune`](https://huggingface.co/datasets/Despina/re_gentune) ## Citation If you use this model, please cite: ```bibtex @article{christou2026subbillion, title = {Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction}, author = {Christou, Despina and Tsoumakas, Grigorios}, journal = {arXiv preprint arXiv:2606.22606}, year = {2026}, url = {https://arxiv.org/abs/2606.22606} } ```