HydraQwen3.5-0.8B

Dual-head retrieval + generation adapter on Qwen/Qwen3.5-0.8B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.

Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice — attention-mode restoration (causal → bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation — and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with 59% lower peak VRAM than running the two models separately.

This 0.8B release is a small-scale instantiation of the same mechanism on the canonical ColQwen3.5 community recipe.

Results — ViDoRe (MTEB) nDCG@5

V1 (10 tasks)

Task nDCG@5
VidoreArxivQARetrieval 0.8511
VidoreDocVQARetrieval 0.6134
VidoreInfoVQARetrieval 0.9061
VidoreShiftProjectRetrieval 0.7682
VidoreSyntheticDocQAAIRetrieval 0.9819
VidoreSyntheticDocQAEnergyRetrieval 0.9628
VidoreSyntheticDocQAGovernmentReportsRetrieval 0.9293
VidoreSyntheticDocQAHealthcareIndustryRetrieval 0.9769
VidoreTabfquadRetrieval 0.7848
VidoreTatdqaRetrieval 0.7859
avg (10/10 tasks) 0.8560

V2 (4 tasks)

Task nDCG@5
Vidore2BioMedicalLecturesRetrieval 0.5423
Vidore2ESGReportsHLRetrieval 0.5678
Vidore2ESGReportsRetrieval 0.4954
Vidore2EconomicsReportsRetrieval 0.5212
avg (4/4 tasks) 0.5317

V3 (8 tasks, 6 language sub-splits each)

Task nDCG@5 (multilingual avg)
Vidore3ComputerScienceRetrieval 0.6319
Vidore3EnergyRetrieval 0.4908
Vidore3FinanceEnRetrieval 0.4343
Vidore3FinanceFrRetrieval 0.3044
Vidore3HrRetrieval 0.4093
Vidore3IndustrialRetrieval 0.3210
Vidore3PharmaceuticalsRetrieval 0.5508
Vidore3PhysicsRetrieval 0.4051
avg (8/8 tasks) 0.4434

V3 tasks have 6 language sub-splits; the reported value is the multilingual mean per task.

Generation equivalence

When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-0.8B. The audit in scripts/test_gen_equivalence.py checks three invariants and a per-layer state-dict comparison:

  • adapter_config.json has modules_to_save=null and does not target lm_head or embed_tokens
  • adapter_model.safetensors contains only LoRA A/B pairs (no standalone weight tensors)
  • Every language_model weight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model

The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.

Generation equivalence

Mode-switching VRAM

The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram.py).

Mode Peak VRAM (GB)
Vanilla base generation 2.02
Hydra retrieval (LoRA on) 1.90
Hydra generation (LoRA off) 2.37
Two-model deployment (separate retriever + generator) 5.79

Hydra peaks at 2.37 GB vs 5.79 GB for the two-model setup — a 59.1% reduction (3.42 GB saved).

Mode-switch VRAM

Training recipe

Parameter Value
Base Qwen/Qwen3.5-0.8B
Loss ColbertLoss, τ=0.02
LoRA r / α / dropout 32 / 32 / 0.05
target_modules Qwen3.5 LM + MLP projections + custom_text_proj
Optimizer adamw_torch
Scheduler cosine, 2.5% warmup
Learning rate 5e-5
Effective batch 224
Precision bf16
Epochs / steps 1 / 528
Seed 42
Dataset vidore/colpali_train_set train (~118K)
Embedding dim 128

Usage

import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel

BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "athrael-soju/HydraQwen3.5-0.8B"

model = ColQwen3_5.from_pretrained(
    BASE, torch_dtype=torch.bfloat16,
    attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()

model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)

# Retrieval: adapter on (see train_hydra_08b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.

Environment

pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d

Resume training

Full optimizer, scheduler, and RNG state are in checkpoint-500/ and checkpoint-528/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).

torchrun --nproc_per_node=<N> scripts/train_hydra_08b.py \
  --output-dir ./out --base-model Qwen/Qwen3.5-0.8B \
  --resume-from-checkpoint ./checkpoint-528 --seed 42

Files

  • adapter_config.json, adapter_model.safetensors — LoRA
  • lm_head.pt — base lm_head, for LoRA-off generation
  • config.json, processor_config.json, tokenizer*.json, chat_template.jinja
  • requirements.txt
  • scripts/
    • train_hydra_08b.py — training entrypoint
    • eval_hydra_08b_worker.py — single-GPU ViDoRe/MTEB eval worker
    • launch_eval_08b.sh — shards the eval across GPUs
    • test_gen_equivalence.py — LoRA-off vs base bitwise weight audit
    • bench_mode_switch_vram.py — peak VRAM across operating modes
    • plot_reports.py — renders the plots shown above
    • regenerate_readme_hf.py — regenerates this model card from results/
    • hf_sanity_check.py — structural audit of this repo
  • demo/ — Gradio chat demo: upload a PDF, ask a question. Retrieval runs with the adapter on, generation runs with the adapter off, MaxSim heatmap toggle available. cd demo && pip install -r requirements.txt && python app.py.
  • results/
    • vidore/ — per-task MTEB scored JSONs (V1 + V2 + V3)
    • generation_equivalence/report.json + plot.png
    • mode_switch_vram/report.json + plot.png
    • repo_sanity/report.json
  • checkpoint-{500,528}/ — full resume state (optimizer + scheduler + RNG × 7 + training_args)

Citation

@misc{hydra08b,
  title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
  author={Athrael Soju},
  year={2026},
  url={https://huggingface.co/athrael-soju/HydraQwen3.5-0.8B},
}
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for athrael-soju/HydraQwen3.5-0.8B

Adapter
(136)
this model

Dataset used to train athrael-soju/HydraQwen3.5-0.8B

Space using athrael-soju/HydraQwen3.5-0.8B 1

Collections including athrael-soju/HydraQwen3.5-0.8B

Paper for athrael-soju/HydraQwen3.5-0.8B