Image-Text-to-Text
Transformers
Safetensors
jvlm
text-generation
multimodal
vlm
vision-language
qwen3
siglip2
conversational
custom_code
🇪🇺 Region: EU
Instructions to use jinaai/jina-vlm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jinaai/jina-vlm with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="jinaai/jina-vlm", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("jinaai/jina-vlm", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jinaai/jina-vlm with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jinaai/jina-vlm" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jinaai/jina-vlm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/jinaai/jina-vlm
- SGLang
How to use jinaai/jina-vlm with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jinaai/jina-vlm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jinaai/jina-vlm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jinaai/jina-vlm" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jinaai/jina-vlm", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use jinaai/jina-vlm with Docker Model Runner:
docker model run hf.co/jinaai/jina-vlm
Georgios Mastrapas commited on
Commit ·
ddfa80b
1
Parent(s): bd51ae6
chore: update arch graph
Browse files- README.md +2 -2
- assets/jvlm_architecture.png +2 -2
README.md
CHANGED
|
@@ -54,7 +54,7 @@ inference: false
|
|
| 54 |
|
| 55 |
[Blog](https://jina.ai/news/jina-vlm-small-multilingual-vision-language-model/) | API | AWS | Azure | GCP | [Arxiv](https://arxiv.org/abs/2512.04032)
|
| 56 |
|
| 57 |
-
`jina-vlm` is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual
|
| 58 |
|
| 59 |

|
| 60 |
|
|
@@ -993,4 +993,4 @@ If you find `jina-vlm` useful in your research, please cite our [technical repor
|
|
| 993 |
|
| 994 |
## License
|
| 995 |
|
| 996 |
-
`jina-vlm` is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).
|
|
|
|
| 54 |
|
| 55 |
[Blog](https://jina.ai/news/jina-vlm-small-multilingual-vision-language-model/) | API | AWS | Azure | GCP | [Arxiv](https://arxiv.org/abs/2512.04032)
|
| 56 |
|
| 57 |
+
`jina-vlm` is a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images.
|
| 58 |
|
| 59 |

|
| 60 |
|
|
|
|
| 993 |
|
| 994 |
## License
|
| 995 |
|
| 996 |
+
`jina-vlm` is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/).
|
assets/jvlm_architecture.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|