HydraQwen3.5-0.8B

Dual-head retrieval + generation adapter on Qwen/Qwen3.5-0.8B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.

Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice — attention-mode restoration (causal → bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation — and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with 59% lower peak VRAM than running the two models separately.

This 0.8B release is a small-scale instantiation of the same mechanism on the canonical ColQwen3.5 community recipe.

Paper: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
Collection: Hydra — Dual-Head Retrieval and Generation
Sister model: HydraQwen3.5-4B (larger-scale Hydra)
Demo: HydraQwen3.5-0.8B-demo (Gradio Space)

Results — ViDoRe (MTEB) nDCG@5

V1 (10 tasks)

Task	nDCG@5
VidoreArxivQARetrieval	0.8511
VidoreDocVQARetrieval	0.6134
VidoreInfoVQARetrieval	0.9061
VidoreShiftProjectRetrieval	0.7682
VidoreSyntheticDocQAAIRetrieval	0.9819
VidoreSyntheticDocQAEnergyRetrieval	0.9628
VidoreSyntheticDocQAGovernmentReportsRetrieval	0.9293
VidoreSyntheticDocQAHealthcareIndustryRetrieval	0.9769
VidoreTabfquadRetrieval	0.7848
VidoreTatdqaRetrieval	0.7859
avg (10/10 tasks)	0.8560

V2 (4 tasks)

Task	nDCG@5
Vidore2BioMedicalLecturesRetrieval	0.5423
Vidore2ESGReportsHLRetrieval	0.5678
Vidore2ESGReportsRetrieval	0.4954
Vidore2EconomicsReportsRetrieval	0.5212
avg (4/4 tasks)	0.5317

V3 (8 tasks, 6 language sub-splits each)

Task	nDCG@5 (multilingual avg)
Vidore3ComputerScienceRetrieval	0.6319
Vidore3EnergyRetrieval	0.4908
Vidore3FinanceEnRetrieval	0.4343
Vidore3FinanceFrRetrieval	0.3044
Vidore3HrRetrieval	0.4093
Vidore3IndustrialRetrieval	0.3210
Vidore3PharmaceuticalsRetrieval	0.5508
Vidore3PhysicsRetrieval	0.4051
avg (8/8 tasks)	0.4434

V3 tasks have 6 language sub-splits; the reported value is the multilingual mean per task.

Generation equivalence

When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-0.8B. The audit in scripts/test_gen_equivalence.py checks three invariants and a per-layer state-dict comparison:

adapter_config.json has modules_to_save=null and does not target lm_head or embed_tokens
adapter_model.safetensors contains only LoRA A/B pairs (no standalone weight tensors)
Every language_model weight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model

The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.

Mode-switching VRAM

The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram.py).

Mode	Peak VRAM (GB)
Vanilla base generation	2.02
Hydra retrieval (LoRA on)	1.90
Hydra generation (LoRA off)	2.37
Two-model deployment (separate retriever + generator)	5.79

Hydra peaks at 2.37 GB vs 5.79 GB for the two-model setup — a 59.1% reduction (3.42 GB saved).

Training recipe

Parameter	Value
Base	`Qwen/Qwen3.5-0.8B`
Loss	ColbertLoss, τ=0.02
LoRA r / α / dropout	32 / 32 / 0.05
`target_modules`	Qwen3.5 LM + MLP projections + `custom_text_proj`
Optimizer	adamw_torch
Scheduler	cosine, 2.5% warmup
Learning rate	5e-5
Effective batch	224
Precision	bf16
Epochs / steps	1 / 528
Seed	42
Dataset	`vidore/colpali_train_set` train (~118K)
Embedding dim	128

Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel

BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "athrael-soju/HydraQwen3.5-0.8B"

model = ColQwen3_5.from_pretrained(
    BASE, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()

model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)

# Retrieval: adapter on (see train_hydra_08b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.

Environment

pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d

Resume training

Full optimizer, scheduler, and RNG state are in checkpoint-500/ and checkpoint-528/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).

torchrun --nproc_per_node=<N> scripts/train_hydra_08b.py \
  --output-dir ./out --base-model Qwen/Qwen3.5-0.8B \
  --resume-from-checkpoint ./checkpoint-528 --seed 42

Files

adapter_config.json, adapter_model.safetensors — LoRA
lm_head.pt — base lm_head, for LoRA-off generation
config.json, processor_config.json, tokenizer*.json, chat_template.jinja
requirements.txt
scripts/
- train_hydra_08b.py — training entrypoint
- eval_hydra_08b_worker.py — single-GPU ViDoRe/MTEB eval worker
- launch_eval_08b.sh — shards the eval across GPUs
- test_gen_equivalence.py — LoRA-off vs base bitwise weight audit
- bench_mode_switch_vram.py — peak VRAM across operating modes
- plot_reports.py — renders the plots shown above
- regenerate_readme_hf.py — regenerates this model card from results/
- hf_sanity_check.py — structural audit of this repo
demo/ — Gradio chat demo: upload a PDF, ask a question. Retrieval runs with the adapter on, generation runs with the adapter off, MaxSim heatmap toggle available. cd demo && pip install -r requirements.txt && python app.py.
results/
- vidore/ — per-task MTEB scored JSONs (V1 + V2 + V3)
- generation_equivalence/ — report.json + plot.png
- mode_switch_vram/ — report.json + plot.png
- repo_sanity/report.json
checkpoint-{500,528}/ — full resume state (optimizer + scheduler + RNG × 7 + training_args)

Citation

@misc{hydra08b,
  title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
  author={Athrael Soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/HydraQwen3.5-0.8B},
}