racineai/VDR_2_vdr-visRAG-colpali
Viewer • Updated • 1.19M • 414 • 7
A lightweight multimodal vision-language model specialized for technical document retrieval.
Flantier-SmolVLM-2B-dse (Document Screenshot Embedding) is a 2B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.
pip install transformers accelerate pillow
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
# Load model and processor
processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-2B-dse")
model = AutoModelForVision2Seq.from_pretrained(
"racineai/Flantier-SmolVLM-2B-dse",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load document image
document_image = Image.open("technical_document.jpg")
# Process for document embedding
doc_messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is shown in this image?"}
]
},
]
doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)
# Generate document embedding
with torch.no_grad():
doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding
doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)
# Process query embedding
query = "What are the specifications of this component?"
query_messages = [
{
"role": "user",
"content": [
{"type": "text", "text": query}
]
},
]
query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)
# Generate query embedding
with torch.no_grad():
query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding
query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)
# Calculate similarity
similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")
This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.
@misc{flantier-smolvlm-dse,
author = {racine.ai},
title = {Flantier-SmolVLM-2B-dse: A Lightweight Document Screenshot Embedding Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/racineai/Flantier-SmolVLM-2B-dse}
}
This model is released under the Apache 2.0 license.
Base model
HuggingFaceTB/SmolLM2-1.7B