vladmandic commited on
Commit
ca908e6
·
verified ·
1 Parent(s): 3e0cf6a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -297
README.md CHANGED
@@ -1,302 +1,9 @@
1
  ---
2
- language:
3
- - en
4
- pipeline_tag: image-to-image
5
- tags:
6
- - image-editing
7
- - text-guided-editing
8
- - diffusion
9
- - sana
10
- - qwen-vl
11
- - multimodal
12
  base_model:
13
- - Efficient-Large-Model/SANA1.5_1.6B_1024px
14
- - Qwen/Qwen3-VL-2B-Instruct
15
  library_name: diffusers
16
  ---
17
 
18
- # VIBE: Visual Instruction Based Editor
19
-
20
- <div align="center">
21
- <img src="VIBE.png" width="800" alt="VIBE"/>
22
- </div>
23
-
24
- <p style="text-align: center;">
25
- <div align="center">
26
- </div>
27
- <p align="center">
28
- <a href="https://riko0.github.io/VIBE"> 🌐 Project Page </a> |
29
- <a href="https://arxiv.org/abs/2601.02242"> 📜 Paper on arXiv </a> |
30
- <a href="https://github.com/ai-forever/vibe"> Github </a> |
31
- <a href="https://huggingface.co/spaces/iitolstykh/VIBE-Image-Edit-DEMO">🤗 Space | </a>
32
- <a href="https://huggingface.co/iitolstykh/VIBE-Image-Edit-DistilledCFG">🤗 VIBE-Image-Edit-DistilledCFG | </a>
33
- </p>
34
-
35
- **VIBE** is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the [Sana1.5-1.6B](https://github.com/NVlabs/Sana) diffusion model and the visual understanding capabilities of [Qwen3-VL-2B-Instruct](https://github.com/QwenLM/Qwen3-VL) to provide **exceptionally fast** and high-quality, instruction-based image manipulation.
36
-
37
- We also provide a faster, **CFG-distilled** version of this model available at [VIBE-Image-Edit-DistilledCFG](https://huggingface.co/iitolstykh/VIBE-Image-Edit-DistilledCFG).
38
-
39
- ## Model Details
40
-
41
- - **Name:** VIBE
42
- - **Task:** Text-Guided Image Editing
43
- - **Architecture:**
44
- - **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
45
- - **Condition Encoder:** Qwen3-VL (2B parameters) for multimodal understanding.
46
- - **Framework:** Built on `diffusers` and `transformers`.
47
- - **Model precision**: torch.bfloat16 (BF16)
48
- - **Model resolution**: This model is developed to edit up to 2048px images with multi-scale heigh and width.
49
-
50
- ## Features
51
-
52
- - **Text-Guided Editing:** Edit images using natural language instructions (e.g., "Add a cat on the sofa").
53
- - **Compact & Efficient:** Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
54
- - **High-Speed Inference:** Utilizes Sana1.5's linear attention mechanism for rapid generation.
55
- - **Multimodal Understanding:** Qwen3-VL ensures strong alignment between visual content and text instructions.
56
- - **Text-to-Image** support.
57
-
58
-
59
- # Inference Requirements
60
-
61
- - `vibe` library
62
- ```bash
63
- pip install git+https://github.com/ai-forever/VIBE
64
- ```
65
- - requirements for `vibe` library:
66
- ```bash
67
- pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
68
- ```
69
-
70
- # Quick start
71
-
72
- ```python
73
- from PIL import Image
74
- import requests
75
- from io import BytesIO
76
- from huggingface_hub import snapshot_download
77
-
78
- from vibe.editor import ImageEditor
79
-
80
- # Download model
81
- model_path = snapshot_download(
82
- repo_id="iitolstykh/VIBE-Image-Edit",
83
- repo_type="model",
84
- )
85
-
86
- # Load model
87
- editor = ImageEditor(
88
- checkpoint_path=model_path,
89
- image_guidance_scale=1.2,
90
- guidance_scale=4.5,
91
- num_inference_steps=20,
92
- device="cuda:0",
93
- )
94
-
95
- # Download test image
96
- resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
97
- image = Image.open(BytesIO(resp.content))
98
-
99
- # Generate edited image
100
- edited_image = editor.generate_edited_image(
101
- instruction="let this case swim in the river",
102
- conditioning_image=image,
103
- num_images_per_prompt=1,
104
- )[0]
105
-
106
- edited_image.save(f"edited_image.jpg", quality=100)
107
- ```
108
-
109
- ## T2I Examples
110
-
111
- <details open>
112
- <summary>(<b>Seed:</b> 234) <b>Prompt:</b> View through the clouds at Earth from a plane</summary>
113
-
114
- ![Image 1](images/other/1.png)
115
-
116
- </details>
117
-
118
- <details open>
119
- <summary>(<b>Seed:</b> 2) <b>Prompt:</b> Medieval castle at sunset surrounded by dense forest and mist</summary>
120
-
121
- ![Image 7](images/other/4.png)
122
-
123
- </details>
124
-
125
- <details open>
126
- <summary>(<b>Seed:</b> 666) <b>Prompt:</b> Portrait of an old wise man with a long white beard surrounded by books and candles</summary>
127
-
128
- ![Image 4](images/other/8.png)
129
-
130
- </details>
131
-
132
- <details>
133
- <summary>(<b>Seed:</b> 9513) <b>Prompt:</b> Night urban street with wet asphalt reflections and neon signs</summary>
134
-
135
- ![Image 5](images/other/9.png)
136
-
137
- </details>
138
-
139
- <details>
140
- <summary>(<b>Seed:</b> 142) <b>Prompt:</b> Futuristic sports car racing in the desert</summary>
141
-
142
- ![Image 2](images/other/10.png)
143
-
144
- </details>
145
-
146
- <details>
147
- <summary>(<b>Seed:</b> 1325) <b>Prompt:</b> Pirate boat in ocean</summary>
148
-
149
- ![Image 3](images/other/2.png)
150
-
151
- </details>
152
-
153
- <details>
154
- <summary>(<b>Seed:</b> 4241) <b>Prompt:</b> Davy Jones portrait</summary>
155
-
156
- ![Image 6](images/other/3.png)
157
-
158
- </details>
159
-
160
- <details>
161
- <summary>(<b>Seed:</b> 142) <b>Prompt:</b> Epic cosmic scene with a huge space station and distant stars</summary>
162
-
163
- ![Image 8](images/other/5.png)
164
-
165
- </details>
166
-
167
- <details>
168
- <summary>(<b>Seed:</b> 42) <b>Prompt:</b> Cherry blossom park in spring with petals falling to the ground</summary>
169
-
170
- ![Image 9](images/other/6.png)
171
-
172
- </details>
173
-
174
-
175
- ## Comparison with SANA1.5_1.6B_1024px
176
-
177
- **Prompt:** Generate an interior of a rustic cabin workshop during winter evening. The viewpoint is from the doorway, showing a workbench with tools, wood shavings on the floor, and a cast-iron stove glowing softly. Place shelves with jars of nails, coils of rope, and folded blankets. Through a small window, show snow falling and pine trees in the twilight. Add warm lamplight creating soft gradients and a gentle vignette. Include a person in a thick sweater sanding a wooden object at the bench, but keep the person small in frame
178
-
179
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
180
- <div style="text-align: center; flex: 1; min-width: 0;">
181
- <img src="images/vibe/image_3.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
182
- <div>VIBE (Seed: 4411)</div>
183
- </div>
184
- <div style="text-align: center; flex: 1; min-width: 0;">
185
- <img src="images/sana/image_3.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
186
- <div>SANA1.5_1.6B_1024px (Seed: 1521)</div>
187
- </div>
188
- </div>
189
-
190
-
191
- ---
192
-
193
- **Prompt:** Generate an ancient jungle temple ruin partially covered in moss and vines, with a waterfall cascading nearby into a shallow pool. Show broken stone steps, carved patterns that are abstract, and damp surfaces with realistic moss detail. Add mist, shafts of sunlight through leaves, and small floating insects. Include a human explorer in the mid-ground, small in frame, wearing a backpack. Lush, cinematic realism.
194
-
195
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
196
- <div style="text-align: center; flex: 1; min-width: 0;">
197
- <img src="images/vibe/image_4.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
198
- <div>VIBE (Seed: 1995)</div>
199
- </div>
200
- <div style="text-align: center; flex: 1; min-width: 0;">
201
- <img src="images/sana/image_4.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
202
- <div>SANA1.5_1.6B_1024px (Seed: 9842)</div>
203
- </div>
204
- </div>
205
-
206
- ---
207
-
208
- **Prompt:** Create a science-fiction interior of a space greenhouse module with hydroponic racks, glowing grow lights, and condensation on transparent walls. Plants include leafy greens and flowering specimens. Tools and tablets have UI elements. Add soft floating dust or microgravity droplets. Clean, detailed, plausible sci-fi aesthetic.
209
-
210
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
211
- <div style="text-align: center; flex: 1; min-width: 0;">
212
- <img src="images/vibe/image_5.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
213
- <div>VIBE (Seed: 2203)</div>
214
- </div>
215
- <div style="text-align: center; flex: 1; min-width: 0;">
216
- <img src="images/sana/image_5.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
217
- <div>SANA1.5_1.6B_1024px (Seed: 143)</div>
218
- </div>
219
- </div>
220
-
221
- ---
222
-
223
- **Prompt:** Beautiful tropical beach with guinea pig swimming in the water and human drinking wine
224
-
225
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
226
- <div style="text-align: center; flex: 1; min-width: 0;">
227
- <img src="images/vibe/image_6.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
228
- <div>VIBE (Seed: 132142)</div>
229
- </div>
230
- <div style="text-align: center; flex: 1; min-width: 0;">
231
- <img src="images/sana/image_6.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
232
- <div>SANA1.5_1.6B_1024px (Seed: 132142)</div>
233
- </div>
234
- </div>
235
-
236
- ---
237
-
238
- **Prompt:** Create a cinematic, rainy night scene in a narrow backstreet of an old downtown area. The camera is at street level, slightly tilted upward, emphasizing wet cobblestones reflecting neon-like colored lights without readable text. Show a small ramen stall with steam rising from pots, hanging paper lanterns that are blank or patterned (no letters), and acouple of stools under a simple awning. Add puddles, scattered trash like crumpled paper, and subtle mist. Include a passerby in the mid-ground seen from behind wearing a hooded jacket and carrying an umbrella, face not visible. Use a moody color palette of deep blues and warm oranges, with soft bokeh highlights and realistic rain streaks
239
-
240
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
241
- <div style="text-align: center; flex: 1; min-width: 0;">
242
- <img src="images/vibe/image_2.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
243
- <div>VIBE (Seed: 1003)</div>
244
- </div>
245
- <div style="text-align: center; flex: 1; min-width: 0;">
246
- <img src="images/sana/image_2.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
247
- <div>SANA1.5_1.6B_1024px (Seed: 3114)</div>
248
- </div>
249
- </div>
250
-
251
- ---
252
-
253
- **Prompt:** Depict a volcanic lava field at twilight with cooled black rock, glowing cracks of magma in the distance, and heat shimmer. The sky is darkening with faint stars emerging. Add thin smoke plumes and red-orange reflections on nearby rocks. Cinematic realism, dramatic contrast
254
-
255
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
256
- <div style="text-align: center; flex: 1; min-width: 0;">
257
- <img src="images/vibe/image_7.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
258
- <div>VIBE (Seed: 1520)</div>
259
- </div>
260
- <div style="text-align: center; flex: 1; min-width: 0;">
261
- <img src="images/sana/image_7.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
262
- <div>SANA1.5_1.6B_1024px (Seed: 1267)</div>
263
- </div>
264
- </div>
265
-
266
- ---
267
-
268
- **Prompt:** Portrait from back of a young woman dressed in Victorian attire standing in an ancient library filled with mirrors and stained glass windows, softly illuminated by sunlight streaming through
269
-
270
- <div style="display: flex; gap: 24px; justify-content: center; align-items: flex-start;">
271
- <div style="text-align: center; flex: 1; min-width: 0;">
272
- <img src="images/vibe/image_1.png" alt="VIBE" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
273
- <div>VIBE (Seed: 4152)</div>
274
- </div>
275
- <div style="text-align: center; flex: 1; min-width: 0;">
276
- <img src="images/sana/image_1.png" alt="SANA1.5_1.6B_1024px" style="width: 100%; max-width: 500px; height: auto; display: block; margin: 0 auto 8px auto;">
277
- <div>SANA1.5_1.6B_1024px (Seed: 6742)</div>
278
- </div>
279
- </div>
280
-
281
-
282
-
283
- ## License
284
-
285
- This project is built upon the SANA. Please refer to the original SANA license for usage terms:
286
- [SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)
287
-
288
- ## Citation
289
-
290
- If you use this model in your research or applications, please acknowledge the original projects:
291
-
292
- - [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
293
- - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
294
-
295
- ```bibtex
296
- @misc{vibe2026,
297
- Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
298
- Title = {VIBE: Visual Instruction Based Editor},
299
- Year = {2026},
300
- Eprint = {arXiv:2601.02242},
301
- }
302
- ```
 
1
  ---
2
+ license: other
 
 
 
 
 
 
 
 
 
3
  base_model:
4
+ - iitolstykh/VIBE-Image-Edit
5
+ pipeline_tag: image-to-image
6
  library_name: diffusers
7
  ---
8
 
9
+ Modified copy of [iitolstykh/VIBE-Image-Edit](https://huggingface.co/iitolstykh/VIBE-Image-Edit) to avoid unnecessary references to custom code and allow clean usage in SD.Next