Image-Text-to-Text
Transformers
GGUF
English
conversational
langdaohlb commited on
Commit
3a9b205
·
verified ·
1 Parent(s): c116eeb

Update README

Browse files
Files changed (1) hide show
  1. README.md +54 -1
README.md CHANGED
@@ -1,3 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  **ZwZ-4B-GGUF**
2
 
3
  This repository provides GGUF-format weights for [ZwZ-4B](https://huggingface.co/inclusionAI/ZwZ-4B), split into two components:
@@ -7,4 +19,45 @@ This repository provides GGUF-format weights for [ZwZ-4B](https://huggingface.co
7
 
8
  These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights.
9
 
10
- Enjoy running this multimodal model on your personal device! 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen3-VL-4B-Instruct
4
+ datasets:
5
+ - inclusionAI/ZwZ-RL-VQA
6
+ - inclusionAI/ZoomBench
7
+ language:
8
+ - en
9
+ license: apache-2.0
10
+ library_name: transformers
11
+ pipeline_tag: image-text-to-text
12
+ ---
13
  **ZwZ-4B-GGUF**
14
 
15
  This repository provides GGUF-format weights for [ZwZ-4B](https://huggingface.co/inclusionAI/ZwZ-4B), split into two components:
 
19
 
20
  These files are compatible with llama.cpp, Ollama, and other GGUF-based tools, supporting inference on CPU, NVIDIA GPU (CUDA), Apple Silicon (Metal), Intel GPUs (SYCL), and more. You can mix precision levels for the language and vision components based on your hardware and performance needs, and even perform custom quantization starting from the FP16 weights.
21
 
22
+ Enjoy running this multimodal model on your personal device! 🚀
23
+
24
+
25
+ ## How to Use
26
+
27
+ To use these models with `llama.cpp`, please ensure you are using the **latest version**—either by [building from source](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) or downloading the most recent [release](https://github.com/ggml-org/llama.cpp/releases/tag/b6907) according to the devices.
28
+
29
+ You can run inference via the command line or through a web-based chat interface.
30
+
31
+ ### CLI Inference (`llama-mtmd-cli`)
32
+
33
+ For example, to run ZwZ-4B with an Q8_0 vision encoder and Q8_0 quantized LLM:
34
+ ```bash
35
+ llama-mtmd-cli \
36
+ -m path/to/ZwZ-4B-Q8_0.gguf \
37
+ --mmproj mmproj-ZwZ-4B-Q8_0.gguf\
38
+ --image test.jpeg \
39
+ -p "What is the publisher name of the newspaper?" \
40
+ --temp 1.0 --top-k 20 --top-p 0.95 -n 1024
41
+ ```
42
+ ### Web Chat (using `llama-server`)
43
+ To serve ZwZ-4B via an OpenAI-compatible API with a web UI:
44
+ ```bash
45
+ llama-server \
46
+ -m path/to/ZwZ-4B-Q8_0.gguf \
47
+ --mmproj mmproj-ZwZ-4B-Q8_0.gguf
48
+ ```
49
+
50
+ ## Citation
51
+
52
+ ```bibtex
53
+ @article{wei2026zooming,
54
+ title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
55
+ author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
56
+ journal={arXiv preprint arXiv:2602.11858},
57
+ year={2026}
58
+ }
59
+ ```
60
+
61
+ ## License
62
+
63
+ This model follows the license of Apache 2.0 License.