---
library_name: transformers
tags:
- fp8
- vllm
- vision
- quantized
- compressed-tensors
- qwen3_vl
- embedding
- multimodal embedding
- feature-extraction
license: apache-2.0
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
base_model: Qwen/Qwen3-VL-Embedding-8B
pipeline_tag: feature-extraction
---

# Qwen3-VL-Embedding-8B-FP8

This is an **FP8 quantized** version of [Qwen/Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B), optimized for efficient inference with vLLM.

## Model Overview

| Attribute | Value |
|-----------|-------|
| Base Model | [Qwen/Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) |
| Quantization | FP8 Dynamic (W8A8) |
| Original Size | ~16 GB (BF16) |
| Quantized Size | ~9 GB (FP8) |
| Memory Savings | ~45% |
| Embedding Dimension | 4096 |
| Supported Inputs | Text, Images, Videos, Multimodal |
| Context Length | 32K tokens |

## Highlights

- **Multimodal Versatility**: Handles text, images, screenshots, and video inputs
- **Efficient Inference**: ~45% memory reduction with minimal accuracy loss
- **vLLM Compatible**: Works with vLLM's pooling runner for high-throughput embedding
- **No Calibration Required**: Uses FP8_DYNAMIC scheme (data-free quantization)

## Quantization Details

| Component | Precision | Notes |
|-----------|-----------|-------|
| Vision Encoder (ViT) | BF16 | Preserved for accuracy |
| LLM Decoder Layers | FP8 | Quantized for efficiency |
| Embeddings | BF16 | Preserved |

- **Scheme**: FP8_DYNAMIC
  - Weights: FP8_E4M3 (per-channel quantization)
  - Activations: Dynamic per-token quantization at runtime
- **Tool**: [llm-compressor](https://github.com/vllm-project/llm-compressor)
- **Calibration**: None required (data-free quantization)

## Hardware Requirements

- **GPU**: NVIDIA GPU with FP8 support (compute capability >= 8.9)
  - Blackwell: RTX 5090, RTX 5080
  - Ada Lovelace: RTX 4090, RTX 4080
  - Hopper: H100, H200
- **VRAM**: ~10GB minimum for inference

## Usage

### With vLLM (>=0.14.0) (Recommended)

```python
from vllm import LLM, EngineArgs
import numpy as np

# Initialize vLLM with pooling runner
engine_args = EngineArgs(
    model="RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    runner="pooling",
    dtype="bfloat16",
    trust_remote_code=True,
)
llm = LLM(**vars(engine_args))

# Prepare inputs
tokenizer = llm.get_tokenizer()

def format_input(text, instruction="Represent the user's input."):
    conversation = [
        {"role": "system", "content": [{"type": "text", "text": instruction}]},
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    prompt = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    return {"prompt": prompt}

# Get embeddings
inputs = [
    format_input("A woman playing with her dog on the beach."),
    format_input("Machine learning for image classification."),
]
outputs = llm.embed(inputs)

# Extract embeddings
embeddings = np.array([o.outputs.embedding for o in outputs])
print(f"Embeddings shape: {embeddings.shape}")  # (2, 4096)

# Compute similarity
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")
```

### With vLLM (>=0.14.0) Server

```bash
# Start the server
vllm serve RamManavalan/Qwen3-VL-Embedding-8B-FP8 --task embed

# Query via API
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "Your text here", "model": "RamManavalan/Qwen3-VL-Embedding-8B-FP8"}'
```

### With Transformers

```python
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    trust_remote_code=True,
)

# Prepare input
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Represent the user's input."}]},
    {"role": "user", "content": [{"type": "text", "text": "Your text here"}]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.device)

# Get embedding (last-token pooling)
with torch.no_grad():
    outputs = model.model(**inputs, output_hidden_states=True)
    # Get the last non-padding token
    seq_len = inputs['attention_mask'].sum(dim=1) - 1
    embedding = outputs.last_hidden_state[0, seq_len[0]]
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)

print(f"Embedding shape: {embedding.shape}")  # (4096,)
```

### Using the Helper Class

This repository includes a helper class for easier embedding extraction:

```python
from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

# Initialize
model = Qwen3VLEmbedder(model_name_or_path="RamManavalan/Qwen3-VL-Embedding-8B-FP8")

# Get embeddings for text, images, or multimodal inputs
inputs = [
    {"text": "A dog on the beach"},
    {"image": "path/to/image.jpg"},
    {"text": "What is in this image?", "image": "path/to/image.jpg"},
]
embeddings = model.process(inputs)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 4096)
```

## Benchmark Results

The base model achieves state-of-the-art performance on multimodal benchmarks:

| Benchmark | Score |
|-----------|-------|
| MMEB-V2 Overall | 77.9 |
| MMTEB Mean | 67.88 |

*FP8 quantization typically preserves >95% of the original model's accuracy.*

## Creation

This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-8B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        r"re:model\.visual\..*",  # Keep vision encoder in BF16
    ]
)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Save
model.save_pretrained("Qwen3-VL-Embedding-8B-FP8", save_compressed=True)
```

## Citation

If you use this model, please cite the original Qwen3-VL-Embedding paper:

```bibtex
@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.04720},
  year={2026}
}
```

## License

Apache 2.0 (same as base model)

## Acknowledgments

- [Qwen Team](https://github.com/QwenLM) for the original Qwen3-VL-Embedding model
- [vLLM Team](https://github.com/vllm-project/vllm) for the inference engine
- [Neural Magic](https://github.com/vllm-project/llm-compressor) for llm-compressor