Instructions to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("jinaai/jina-embeddings-v5-omni-small-clustering-GGUF") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - llama-cpp-python
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="jinaai/jina-embeddings-v5-omni-small-clustering-GGUF", filename="jina-embeddings-v5-omni-small-clustering-F16.gguf", )
llm.create_chat_completion( messages = "\"Today is a sunny day and I will get some ice cream.\"" )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Use Docker
docker model run hf.co/jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Ollama:
ollama run hf.co/jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
- Unsloth Studio
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jinaai/jina-embeddings-v5-omni-small-clustering-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jinaai/jina-embeddings-v5-omni-small-clustering-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jinaai/jina-embeddings-v5-omni-small-clustering-GGUF to start chatting
- Pi
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Docker Model Runner:
docker model run hf.co/jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
- Lemonade
How to use jinaai/jina-embeddings-v5-omni-small-clustering-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.jina-embeddings-v5-omni-small-clustering-GGUF-Q4_K_M
List all available models
lemonade list
- jina-embeddings-v5-omni-small-clustering-GGUF: Clustering-Targeted Omni Embedding (Small) — GGUF
- Model Overview
- Via Elastic Inference Service
- Files in this repo
- Install llama.cpp (with multimodal patches)
- Quickstart — text via
llama-embedding - Quickstart — text + image via
llama-server - Quickstart — text + video via
llama-server - Quickstart — text + audio via
llama-server - Selective modality loading (text / vision / audio / omni)
- Matryoshka (truncating embeddings)
- Text quantization levels
- Batching
- Multimodal parity vs torch (cos ≥ 0.99 numerical bar)
- Compatibility
- Notes
- License
jina-embeddings-v5-omni-small-clustering-GGUF: Clustering-Targeted Omni Embedding (Small) — GGUF
Average score vs. parameter count across image (MIEB-Lite), video (MMEB-V), and audio (MAEB) benchmarks — jina-v5-omni-nano and jina-v5-omni-small define the open-weight frontier (Table 1 in the ArXiv report).
Model Overview
GGUF + multimodal-projector build of
jinaai/jina-embeddings-v5-omni-small-clustering for
llama.cpp. Accepts text,
images, video, and audio and produces 1024-dim embeddings in the
same vector space as the torch reference and as
jinaai/jina-embeddings-v5-text-small-clustering at the same task —
index with text and query with any modality, no reindexing. For a more compact alternative, see jinaai/jina-embeddings-v5-omni-nano-clustering-GGUF.
This is the clustering-targeted variant of the jina-embeddings-v5-omni-small GGUF family. The umbrella with all GGUF variants and cross-repo benchmarks is jina-ai/jina-embeddings-v5-omni-gguf.
| Feature | Value |
|---|---|
| Parameters | ~1.56B (text + vision + audio towers) |
| Embedding Dimension | 1024 |
| Supported Tasks | clustering |
| Max Sequence Length | 32768 |
| Pooling Strategy | Last-token |
| Supported Inputs | text, image, video, audio |
| Supported File Types | images: .jpg, .jpeg, .png, .gif, .webp, .bmp, .tif, .tiff, .avif, .heic, .svg; video: .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv; audio: .wav, .mp3, .flac, .ogg, .m4a, .opus; documents: .pdf |
| Matryoshka Dimensions | 32, 64, 128, 256, 512, 768, 1024 |
| Quantization | text: F16 + 13 imatrix-calibrated int-quant levels; mmprojs: F16 |
Cross-repo docs: benchmarks (NDCG@5 on NanoBEIR, tokens/sec, peak VRAM, file size), the full per-variant numerical-parity tables, and runtime caveats live in the v5-omni-gguf umbrella.
Via Elastic Inference Service
The fastest way to use v5-omni in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.
# Retrieve the configuration of the preconfigured omni-small inference endpoint
GET /_inference/embedding/.jina-embeddings-v5-omni-small
# Generate an embedding for a single piece of text using the predefined endpoint
POST _inference/embedding/.jina-embeddings-v5-omni-small
{
"input": [
"This is a test"
]
}
# Fuse a text description and an image into a single embedding via a multimodal content block
POST _inference/embedding/.jina-embeddings-v5-omni-small
{
"input": [
{
"content": [
{ "type": "text", "value": "A small blue square" },
{ "type": "image", "format": "base64", "value": "<BASE64_IMAGE_DATA>" }
]
}
]
}
# Create a custom endpoint that truncates omni-small embeddings to 32 dimensions
PUT _inference/embedding/jina-omni-small-32d
{
"service": "elastic",
"service_settings": {
"model_id": "jina-embeddings-v5-omni-small",
"dimensions": 32
}
}
See the Elastic Inference Service documentation for setup details.
Files in this repo
| File | Purpose |
|---|---|
jina-embeddings-v5-omni-small-clustering-F16.gguf (and 13 *-Q*.gguf quants) |
text-tower GGUF (with vocab + tokenizer) |
jina-embeddings-v5-omni-small-clustering-vision-mmproj-F16.gguf |
vision multimodal projector (Qwen3-VL ViT + merger) |
jina-embeddings-v5-omni-small-clustering-audio-mmproj-F16.gguf |
audio multimodal projector (Qwen2.5-Omni audio encoder + Linear) |
llama.cpp loads --mmproj to enable image / video / audio inputs on top
of the text GGUF. The two mmprojs are independent — load whichever
modality you need, or pass --mmproj twice to serve both from one
process (see "Selective modality loading" below).
Install llama.cpp (with multimodal patches)
This model relies on the Jina v5 omni patches (audio chunked attention,
qwen3vl video temporal-pair, encoder combined-decode, etc.) — they are
not yet upstream. Build from the feat-v5-omni fork:
git clone https://github.com/jina-ai/llama.cpp.git
cd llama.cpp
git checkout feat-v5-omni
cmake -B build && cmake --build build --config Release -j
For CUDA: pass -DGGML_CUDA=ON to the configure step.
H100 / Hopper note. On Hopper GPUs (H100, H200), set
GGML_CUDA_DISABLE_GRAPHS=1before launchingllama-server. Without it, the CUDA-graph capture/replay path crashes withcudaMemcpyAsync … illegal instructionduring embedding extraction. CPU, Metal, Vulkan, and pre-Hopper CUDA (e.g. L4, A100) are unaffected.
Quickstart — text via llama-embedding
./build/bin/llama-embedding \
-hf jinaai/jina-embeddings-v5-omni-small-clustering-GGUF:Q4_K_M \
--pooling last --embd-normalize 2 \
-p "A cute cat sitting on a mat."
The -hf shortcut downloads + caches the requested quant from this repo
on first use. Q4_K_M is the recommended CPU default; Q8_0 for highest
fidelity; IQ2_*/IQ1_* for very tight memory budgets.
No prefix convention. Clustering text is embedded verbatim — no Query: / Document: prefixes are needed (unlike the retrieval variant). Both sides of any pair go in unprefixed.
No custom pooling or padding code needed — --pooling last and --embedding are the only flags required; min_pixels / max_pixels / temporal_patch_size are baked into the GGUF metadata and the mmproj.
Quickstart — text + image via llama-server
Start the server with the vision mmproj:
./build/bin/llama-server \
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf \
--mmproj jina-embeddings-v5-omni-small-clustering-vision-mmproj-F16.gguf \
--embedding --pooling last \
--host 127.0.0.1 --port 8080
POST to /embeddings with the v5-omni image prompt template:
import base64, requests
with open("photo.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
# text query
q = requests.post("http://127.0.0.1:8080/embeddings", json={
"content": [{"prompt_string": "A cute cat sitting on a mat."}]
}).json()[0]["embedding"]
# image embedding (one base64-encoded image per <__media__> marker)
i = requests.post("http://127.0.0.1:8080/embeddings", json={
"content": [{
"prompt_string": "<__media__>",
"multimodal_data": [img_b64],
}]
}).json()[0]["embedding"]
The <__media__> placeholder is replaced server-side with the right
sequence of image tokens.
Quickstart — text + video via llama-server
Same vision mmproj, but use videopair_data to pass frame pairs
(temporal_patch_size=2, matching torch's 3D conv with kt=2):
import base64, imageio.v3 as iio, requests
frames = list(iio.imiter("clip.mp4")) # decode video → list of HxWx3 frames
def b64(arr):
import io, numpy as np
from PIL import Image
buf = io.BytesIO(); Image.fromarray(np.asarray(arr)).convert("RGB").save(buf, "PNG")
return base64.b64encode(buf.getvalue()).decode()
# group consecutive frames into pairs
pairs = [(b64(frames[i]), b64(frames[i+1])) for i in range(0, len(frames) - 1, 2)]
prompt = "<__media__>" * len(pairs) # one marker per logical (paired) frame
v = requests.post("http://127.0.0.1:8080/embeddings", json={
"content": [{"prompt_string": prompt, "videopair_data": pairs}]
}).json()[0]["embedding"]
Quickstart — text + audio via llama-server
Start a server with the audio mmproj (or run a second instance on a different port if you already have a vision server up):
./build/bin/llama-server \
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf \
--mmproj jina-embeddings-v5-omni-small-clustering-audio-mmproj-F16.gguf \
--embedding --pooling last \
-b 4096 -ub 4096 \
--host 127.0.0.1 --port 8081
The -b 4096 -ub 4096 flags bump the physical batch size since audio
prompts can expand to ~750 tokens for a 30s clip.
import base64, requests
with open("speech.wav", "rb") as f:
a_b64 = base64.b64encode(f.read()).decode()
a = requests.post("http://127.0.0.1:8081/embeddings", json={
"content": [{
"prompt_string": "<__media__>",
"multimodal_data": [a_b64],
}]
}).json()[0]["embedding"]
WAV / MP3 / FLAC are accepted; audio is resampled internally to 16kHz mono. For an 11s clip the runtime emits 275 audio tokens; for a 30s clip, 750.
Selective modality loading (text / vision / audio / omni)
Mirrors the HF modality= argument. Pass at most one vision mmproj and
at most one audio mmproj — the fork accepts --mmproj more than once:
modality= |
flags |
|---|---|
"text" |
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf |
"vision" |
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf --mmproj jina-embeddings-v5-omni-small-clustering-vision-mmproj-F16.gguf |
"audio" |
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf --mmproj jina-embeddings-v5-omni-small-clustering-audio-mmproj-F16.gguf |
"omni" |
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf --mmproj jina-embeddings-v5-omni-small-clustering-vision-mmproj-F16.gguf --mmproj jina-embeddings-v5-omni-small-clustering-audio-mmproj-F16.gguf |
Combined invocation:
./build/bin/llama-server \
-m jina-embeddings-v5-omni-small-clustering-Q4_K_M.gguf \
--mmproj jina-embeddings-v5-omni-small-clustering-vision-mmproj-F16.gguf \
--mmproj jina-embeddings-v5-omni-small-clustering-audio-mmproj-F16.gguf \
--embedding --pooling last \
--host 127.0.0.1 --port 8080 \
-b 8192 -ub 8192
Vision and audio embeddings produced this way are bit-identical to the single-mmproj invocations — the encoder graph is the same regardless of whether the other modality's projector is also loaded.
Matryoshka (truncating embeddings)
Any prefix of the output vector is itself a valid embedding once
L2-renormalized. Supported prefix dims: {32, 64, 128, 256, 512, 768, 1024}. Verified
end-to-end through the GGUF encode pipeline: prefix dims produce
vectors with cos-vs-torch matching the full vector to within
quantization noise (max prefix-vs-full drift: +0.0000 at F16,
+0.0008 at Q4_K_M for this small model).
import numpy as np
full = np.asarray(v)
truncated = full[:256]
truncated /= np.linalg.norm(truncated)
Text quantization levels
F16 + 13 int-quant levels, imatrix-calibrated against a multilingual
text corpus (calibration_data_v5_rc.txt). Numbers below are min cos
vs torch fp32 across the 7-input reference set in
ref_small_clustering.json, bucketed by token length
(very_short = 2-4 tokens, short = 5-15, medium = 16-30):
| Level | very_short | short | medium |
|---|---|---|---|
| F16 | 0.9999 | 1.0000 | 1.0000 |
| Q8_0 | 0.9986 | 0.9998 | 0.9997 |
| Q6_K | 0.9951 | 0.9990 | 0.9984 |
| Q5_K_M | 0.9938 | 0.9985 | 0.9976 |
| Q5_K_S | 0.9910 | 0.9979 | 0.9971 |
| Q4_K_M | 0.9856 | 0.9960 | 0.9936 |
| IQ4_NL | 0.9833 | 0.9932 | 0.9897 |
| IQ4_XS | 0.9802 | 0.9924 | 0.9887 |
| Q3_K_M | 0.9686 | 0.9896 | 0.9858 |
| Q2_K | 0.8212 | 0.9575 | 0.9375 |
| IQ2_M | 0.8144 | 0.9532 | 0.9334 |
| IQ2_XXS | 0.7750 | 0.8553 | 0.8160 |
| IQ1_M | 0.3100 | 0.7767 | 0.6790 |
| IQ1_S | 0.5355 | 0.7312 | 0.3854 |
Recommendation: Q4_K_M is the production CPU default for small. Higher levels (Q5_K_M / Q6_K / Q8_0) are conservative choices when very-short or multilingual inputs (titles, single-word queries) dominate. IQ-quants and Q2_K and below break down on tiny inputs — use only for memory-constrained testing.
The vision and audio mmprojs ship in F16 only — quantization beyond F16 on the projector tensors causes large parity loss and is not worth the disk savings.
Batching
llama-server's /embeddings endpoint accepts a list of inputs in the content array — one forward pass per element, returned as separate embeddings:
import requests
batch = requests.post("http://127.0.0.1:8080/embeddings", json={
"content": [
{"prompt_string": "A cute cat sitting on a mat."},
{"prompt_string": "A red sports car parked under a tree."},
]
}).json()
# batch[0]["embedding"], batch[1]["embedding"]
Multimodal inputs are forwarded per-sample (one pass per image / video / audio). Long text-only batches benefit most from -b 8192 -ub 8192. For high-throughput multimodal serving, prefer the vLLM path on the torch base model.
Multimodal parity vs torch (cos ≥ 0.99 numerical bar)
Verified on the same fork build that produces this repo:
| Modality | small-clustering |
|---|---|
| Text | 7/7 inputs ≥ 0.999 |
| Image (car) | 0.9981 |
| Image (cat) | 0.9991 |
| Audio (JFK 11s) | 0.9999 |
| PDF (2-page fused) | 0.9997 |
| Video (4-frame, 512²) | 0.9965 |
Cross-variant comparison: see the v5-omni-gguf umbrella.
Compatibility
Same vector space as:
jinaai/jina-embeddings-v5-text-small-clustering— text-onlyjinaai/jina-embeddings-v5-omni-small-clustering— multimodal (transformers / ST / vLLM)
Notes
- Last-token pooling is used throughout (matches the torch reference).
- For best parity, prefer
Q4_K_Mor higher for the text quantization; the mmprojs ship F16 only. - The
<image>token is shared between image and video inputs; video usesimage_grid_thw=[T,H,W]with T=2 (the Qwen3-VL ViT's Conv3d patch_embed handles the temporal dimension). The GGUFvideopair_dataAPI is identical for image and video paths.
License
CC BY-NC 4.0. For commercial use, contact us.
- Downloads last month
- 1,390
1-bit
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for jinaai/jina-embeddings-v5-omni-small-clustering-GGUF
Base model
jinaai/jina-embeddings-v5-omni-small