Instructions to use athrael-soju/HydraQwen3.5-0.8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use athrael-soju/HydraQwen3.5-0.8B with PEFT:
Task type is invalid.
- ColPali
How to use athrael-soju/HydraQwen3.5-0.8B with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
HydraQwen3.5-0.8B
Dual-head retrieval + generation adapter on Qwen/Qwen3.5-0.8B. Toggle the LoRA at inference: on for ColBERT-style late-interaction retrieval, off for autoregressive generation on the base model.
Hydra starts from the observation that a LoRA adapter trained for retrieval leaves the base model's weights intact by construction: disabling the adapter recovers the original generation head bit-for-bit. The contribution of the paper is identifying three engineering requirements that make this usable in practice — attention-mode restoration (causal → bidirectional toggle on the full-attention layers), lm_head preservation under weight tying and DDP gradient synchronisation, and KV-cache-aware generation — and showing that, once those are addressed, one VLM instance can serve both ColBERT-style late-interaction document retrieval and autoregressive generation without any generation training and with 59% lower peak VRAM than running the two models separately.
This 0.8B release is a small-scale instantiation of the same mechanism on the canonical ColQwen3.5 community recipe.
- Paper: Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model
- Collection: Hydra — Dual-Head Retrieval and Generation
- Sister model:
HydraQwen3.5-4B(larger-scale Hydra) - Demo: HydraQwen3.5-0.8B-demo (Gradio Space)
Results — ViDoRe (MTEB) nDCG@5
V1 (10 tasks)
| Task | nDCG@5 |
|---|---|
| VidoreArxivQARetrieval | 0.8511 |
| VidoreDocVQARetrieval | 0.6134 |
| VidoreInfoVQARetrieval | 0.9061 |
| VidoreShiftProjectRetrieval | 0.7682 |
| VidoreSyntheticDocQAAIRetrieval | 0.9819 |
| VidoreSyntheticDocQAEnergyRetrieval | 0.9628 |
| VidoreSyntheticDocQAGovernmentReportsRetrieval | 0.9293 |
| VidoreSyntheticDocQAHealthcareIndustryRetrieval | 0.9769 |
| VidoreTabfquadRetrieval | 0.7848 |
| VidoreTatdqaRetrieval | 0.7859 |
| avg (10/10 tasks) | 0.8560 |
V2 (4 tasks)
| Task | nDCG@5 |
|---|---|
| Vidore2BioMedicalLecturesRetrieval | 0.5423 |
| Vidore2ESGReportsHLRetrieval | 0.5678 |
| Vidore2ESGReportsRetrieval | 0.4954 |
| Vidore2EconomicsReportsRetrieval | 0.5212 |
| avg (4/4 tasks) | 0.5317 |
V3 (8 tasks, 6 language sub-splits each)
| Task | nDCG@5 (multilingual avg) |
|---|---|
| Vidore3ComputerScienceRetrieval | 0.6319 |
| Vidore3EnergyRetrieval | 0.4908 |
| Vidore3FinanceEnRetrieval | 0.4343 |
| Vidore3FinanceFrRetrieval | 0.3044 |
| Vidore3HrRetrieval | 0.4093 |
| Vidore3IndustrialRetrieval | 0.3210 |
| Vidore3PharmaceuticalsRetrieval | 0.5508 |
| Vidore3PhysicsRetrieval | 0.4051 |
| avg (8/8 tasks) | 0.4434 |
V3 tasks have 6 language sub-splits; the reported value is the multilingual mean per task.
Generation equivalence
When the LoRA is disabled, the Hydra model is the vanilla Qwen/Qwen3.5-0.8B. The audit in scripts/test_gen_equivalence.py checks three invariants and a per-layer state-dict comparison:
adapter_config.jsonhasmodules_to_save=nulland does not targetlm_headorembed_tokensadapter_model.safetensorscontains only LoRA A/B pairs (no standalone weight tensors)- Every
language_modelweight tensor in the Hydra stack is byte-for-byte identical to the corresponding weight in a freshly-loaded base model
The left panel below shows the three invariants (all pass). The right panel shows the state-dict comparison across all language-model weight tensors.
Mode-switching VRAM
The dual-head design means one process holds one set of weights, toggling the LoRA in place. A conventional setup needs two separate models in memory for the same retrieve-then-generate flow. Both configurations are measured on the same hardware with the same inputs (scripts/bench_mode_switch_vram.py).
| Mode | Peak VRAM (GB) |
|---|---|
| Vanilla base generation | 2.02 |
| Hydra retrieval (LoRA on) | 1.90 |
| Hydra generation (LoRA off) | 2.37 |
| Two-model deployment (separate retriever + generator) | 5.79 |
Hydra peaks at 2.37 GB vs 5.79 GB for the two-model setup — a 59.1% reduction (3.42 GB saved).
Training recipe
| Parameter | Value |
|---|---|
| Base | Qwen/Qwen3.5-0.8B |
| Loss | ColbertLoss, τ=0.02 |
| LoRA r / α / dropout | 32 / 32 / 0.05 |
target_modules |
Qwen3.5 LM + MLP projections + custom_text_proj |
| Optimizer | adamw_torch |
| Scheduler | cosine, 2.5% warmup |
| Learning rate | 5e-5 |
| Effective batch | 224 |
| Precision | bf16 |
| Epochs / steps | 1 / 528 |
| Seed | 42 |
| Dataset | vidore/colpali_train_set train (~118K) |
| Embedding dim | 128 |
Usage
import torch
from transformers import Qwen3_5ForConditionalGeneration
from colpali_engine.models import ColQwen3_5, ColQwen3_5Processor
from peft import PeftModel
BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "athrael-soju/HydraQwen3.5-0.8B"
model = ColQwen3_5.from_pretrained(
BASE, torch_dtype=torch.bfloat16,
attn_implementation="sdpa", ignore_mismatched_sizes=True,
)
fcg = Qwen3_5ForConditionalGeneration.from_pretrained(
BASE, torch_dtype=torch.bfloat16, attn_implementation="sdpa",
)
model.load_state_dict(fcg.model.state_dict(), strict=False)
del fcg; torch.cuda.empty_cache()
model = PeftModel.from_pretrained(model, ADAPTER).cuda().eval()
processor = ColQwen3_5Processor.from_pretrained(BASE, max_num_visual_tokens=768)
# Retrieval: adapter on (see train_hydra_08b.py for the bidirectional-attention patch).
# Generation: `with model.disable_adapter(): model.generate(...)` using the saved lm_head.
Environment
pip install -r requirements.txt
pip install flash-linear-attention causal-conv1d
Resume training
Full optimizer, scheduler, and RNG state are in checkpoint-500/ and checkpoint-528/. Resume with the same WORLD_SIZE as the original run (one per-rank rng_state_*.pth was written).
torchrun --nproc_per_node=<N> scripts/train_hydra_08b.py \
--output-dir ./out --base-model Qwen/Qwen3.5-0.8B \
--resume-from-checkpoint ./checkpoint-528 --seed 42
Files
adapter_config.json,adapter_model.safetensors— LoRAlm_head.pt— base lm_head, for LoRA-off generationconfig.json,processor_config.json,tokenizer*.json,chat_template.jinjarequirements.txtscripts/train_hydra_08b.py— training entrypointeval_hydra_08b_worker.py— single-GPU ViDoRe/MTEB eval workerlaunch_eval_08b.sh— shards the eval across GPUstest_gen_equivalence.py— LoRA-off vs base bitwise weight auditbench_mode_switch_vram.py— peak VRAM across operating modesplot_reports.py— renders the plots shown aboveregenerate_readme_hf.py— regenerates this model card fromresults/hf_sanity_check.py— structural audit of this repo
demo/— Gradio chat demo: upload a PDF, ask a question. Retrieval runs with the adapter on, generation runs with the adapter off, MaxSim heatmap toggle available.cd demo && pip install -r requirements.txt && python app.py.results/vidore/— per-task MTEB scored JSONs (V1 + V2 + V3)generation_equivalence/—report.json+plot.pngmode_switch_vram/—report.json+plot.pngrepo_sanity/report.json
checkpoint-{500,528}/— full resume state (optimizer + scheduler + RNG × 7 + training_args)
Citation
@misc{hydra08b,
title={Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model},
author={Athrael Soju},
year={2026},
url={https://huggingface.co/athrael-soju/HydraQwen3.5-0.8B},
}
- Downloads last month
- 10

