SmolLM3-3B — RE LitTune (0-shot)

A 3B language model fine-tuned for literary relation extraction (RE). This is the best literary-domain checkpoint from the paper "Sub-Billion, Super-Frontier: Fine-Tuned Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction" (arXiv:2606.22606).

It reaches a 0.833 two-benchmark literary average (positive-class micro-F1) — the single highest literary score across all 30 tuned configurations in the paper — versus 0.578 for zero-shot GPT-5.4 under the same minimal protocol (a margin of more than 25 F1 points), and 0.92 vs 0.83 against GPT-5.4 on the human-annotated Biographical benchmark. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models over frontier LLMs.

It is a QLoRA (LoRA) adapter on top of HuggingFaceTB/SmolLM3-3B, tuned on the LitTune literary mixture (Biographical + PG-Fiction) using the 0-shot prompt style.

What it does

Given a sentence and two marked entities, the model outputs only the relation label that holds between them (one label, no explanation).

⚠️ Important: disable reasoning at inference

SmolLM3-3B is a reasoning model. For this single-label RE task you should turn thinking off, otherwise the model may emit <think>...</think> tokens instead of a label. Pass enable_thinking=False to the chat template (as shown below) and strip any residual <think> span from the output.

Usage

This repo is a PEFT LoRA adapter, so load the base model and attach the adapter:

import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "HuggingFaceTB/SmolLM3-3B"
ADAPTER = "Despina/SmolLM3-3B-re_littune-0-shot"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

system_prompt = (
    "You are a relation extraction system. Be concise and direct. "
    "Output ONLY the relation type that holds between the two mentioned entities. "
    "Do not output any explanation, punctuation, or extra text — only the label."
)
user_prompt = (
    "Sentence: Elizabeth married Mr. Darcy at Pemberley.\n"
    "Entity 1: Elizabeth\n"
    "Entity 2: Mr. Darcy\n"
    "Relation:"
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, enable_thinking=False, return_tensors="pt"
).to(model.device)

out = model.generate(inputs, max_new_tokens=16, do_sample=False)
text = tokenizer.decode(out[0, inputs.shape[-1]:], skip_special_tokens=True)
text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
print(text)

This checkpoint was tuned in the 0-shot regime, so no in-context examples are required — a system prompt asking for the label only, plus the query, matches training.

Training


Base model	`HuggingFaceTB/SmolLM3-3B`
Method	QLoRA (4-bit NF4, bf16 compute, double quant)
LoRA	r = 64, α = 128, dropout = 0.05; targets: q/k/v/o + gate/up/down proj
Training data	`Despina/re_littune` (LitTune literary mixture), 0-shot prompts
Objective	Generate the relation label only
Epochs	2
Learning rate	1e-4
Effective batch	4 × 2 grad-accum = 8
Max sequence length	1024

LitTune aggregates the two literary RE datasets: Biographical and PG-Fiction.

Evaluation

Scored with positive-class micro-F1 (the no-relation class is excluded from the average). On the two literary benchmarks the model scores a 0.833 literary average, the highest of any configuration in the paper, versus zero-shot GPT-5.4 (0.578) under a minimal zero-shot protocol. As the paper stresses, this reflects targeted task adaptation rather than any intrinsic superiority of small models. See the paper for the full 30-configuration matrix, general-domain results, and the RoBERTa discriminative baseline.

Limitations

Trained to emit a single relation label; it is not a general-purpose chat model.
Reasoning model: use enable_thinking=False (see above), or it may output <think> tokens.
Tuned on literary text; expect degradation on out-of-domain / general inputs (see the cross-domain analysis in the paper).
PG-Fiction labels are annotated by a GPT-4-class model, so the model partly learns that annotator's label distribution.
Inherits the biases and licensing constraints of its underlying datasets.

Citation

If you use this model, please cite:

@article{christou2026subbillion,
  title        = {Sub-Billion, Super-Frontier: Small Language Models Rival
                  Zero-Shot Frontier LLMs on General and Literary Relation Extraction},
  author       = {Christou, Despina and Tsoumakas, Grigorios},
  journal      = {arXiv preprint arXiv:2606.22606},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.22606}
}

Downloads last month: -

Model tree for Despina/SmolLM3-3B-re_littune-0-shot

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Adapter

(43)

this model

Dataset used to train Despina/SmolLM3-3B-re_littune-0-shot

Paper for Despina/SmolLM3-3B-re_littune-0-shot

Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction

Paper • 2606.22606 • Published 11 days ago

Despina
/

SmolLM3-3B-re_littune-0-shot