Boogu-AI commited on
Commit
9a26829
·
verified ·
1 Parent(s): a1c0f0f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +236 -0
README.md ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="assets/boogu-logo-title.svg" alt="Boogu-Image-0.1" width="420" />
3
+ </p>
4
+
5
+ <h3 align="center">Boosting Open-Source Unified Multimodal Understanding and Generation</h3>
6
+
7
+ <div align="center">
8
+
9
+ <img src="assets/boogu-infinity-teaser.png" alt="Boogu-Image-0.1 Teaser" width="100%" />
10
+
11
+
12
+
13
+ <!-- ============== Badges ============== -->
14
+ <!-- [![arXiv](https://img.shields.io/badge/arXiv-{{ paper_id }}-b31b1b.svg?logo=arxiv&logoColor=white)](https://arxiv.org/abs/{{ paper_id }}) -->
15
+ [![Project Page](https://img.shields.io/badge/🌐-Project%20Page-blue)](https://boogu.org)
16
+ [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow)](https://huggingface.co/Boogu)
17
+ [![GitHub](https://img.shields.io/badge/GitHub-Repo-181717?logo=github&logoColor=white)](https://github.com/boogu-project/Boogu-Image)
18
+ [![Paper](https://img.shields.io/badge/📄-Technical%20Report%20(Coming%20Soon)-lightgrey)]()
19
+ <!-- [![ModelScope](https://img.shields.io/badge/🤖-ModelScope-624aff)]({{ modelscope_url }}) -->
20
+ [![Demo-Base](https://img.shields.io/badge/🎨-Demo%20Base-ff69b4)](http://demo-base.boogu.org/)
21
+ [![Demo-Edit](https://img.shields.io/badge/🖌️-Demo%20Edit-ff8c00)](http://demo-edit.boogu.org/)
22
+ [![Demo-Turbo](https://img.shields.io/badge/⚡-Demo%20Turbo-9b59b6)](http://demo-turbo.boogu.org/)
23
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
24
+
25
+
26
+ Welcome to the official repository for **Boogu-Image-0.1** !
27
+
28
+ English | [中文](./README_CN.md)
29
+
30
+ </div>
31
+
32
+ ---
33
+
34
+ ## 📖 Introduction
35
+
36
+ **Boogu-Image-0.1** is an extremely competitive **Apache-2.0 open-source unified image generation and editing model family**, including **Base**, **Turbo**, **Edit**, and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing, and Chinese-English text rendering. Closed-source multimodal understanding and generation systems like Nano Banana Pro and GPT-Image-2 achieve remarkable performance not because of a single model, but through a highly unified suite of system capabilities. However, under training compute that is extremely limited compared with closed-source systems, we find that systematically improving a model's understanding ability, data quality, and training pipeline can still significantly improve image generation and editing performance. Specifically, compared with the strong open-source work Qwen-Image, our training data scale is roughly one order of magnitude smaller. We hope our empirical study and open-source release will help advance the open-source ecosystem for multimodal generation and understanding.
37
+
38
+ This repository provides checkpoints and inference code for **Boogu-Image-0.1**.
39
+
40
+ ## 🏆 Boogu Arena
41
+
42
+ Since we could not evaluate on LM Arena directly, we built **Boogu Arena**, an LM Arena-style preference evaluation. We use an LLM to generate diverse user personas, then ask each persona to produce image generation prompts, resulting in **1K+ test prompts** that we will release publicly for community reproduction. The ELO leaderboard below spans leading closed- and open-source systems.
43
+
44
+ <!-- <p align="center">
45
+ <img src="assets/ci_chart.svg" alt="Boogu Arena ELO Leaderboard" width="100%" />
46
+ </p> -->
47
+ <p align="center">
48
+ <img src="assets/boogu-arena-chart.svg" alt="Boogu Arena ELO Leaderboard" width="100%" />
49
+ </p>
50
+
51
+ ## ✨ Highlights
52
+
53
+ - 📸 **Beautiful and Precise Photography** — Accurately understands photography prompts and generates high-quality images with natural lighting, coherent composition, and faithful details, preserving coherent subject, background, and spatial relationships even in complex real-world scenes
54
+ - 📝 **Diverse and Stable Text Rendering** — Supports a wide range of text-heavy designs — posters, stamps, documents, interfaces, brand guides, and handwritten boards — with readable structure, stable typography, and robust bilingual (Chinese/English) rendering across diverse layouts
55
+ - 🎨 **Diverse and Beautiful Stylization** — Handles stylized generation across miniature 3D scenes, Chinese-inspired gilded aesthetics, shining fantasy visuals, anime portraits, and mythic character art — not just style transfer, but stable, attractive, and prompt-aware creative generation
56
+ - 📊 **Competitive General Performance** — Demonstrates competitive performance across many scenarios and benchmarks, with the Boogu-Image-0.1 family ranking among the very top of evaluated open- and closed-source systems in Boogu Arena
57
+
58
+ > 📖 For the full set of practical lessons and an honest account of current limitations, see [Responsible AI & Limitations](#-responsible-ai--limitations) below.
59
+
60
+ ## 📣 News
61
+
62
+ - **2026-06-16** 🔥 **Boogu-Image-0.1-Base (Text-to-Image) is released!** The core text-to-image foundation model. Try the [online demo](http://demo-base.boogu.org/).
63
+ - **2026-06-16** 🎨 **Boogu-Image-0.1-Edit (Image-to-Image) is released!** Image editing and transformation capabilities now available. Try the [online demo](http://demo-edit.boogu.org/).
64
+ - **2026-06-16** 🚀 **Boogu-Image-0.1-Turbo is released!** Four-step distilled variant for fast inference and photorealistic generation. Try the [online demo](http://demo-turbo.boogu.org/).
65
+ <!-- - **[{{ 2026-06-DD }}]** 📄 **Technical report is released!** Read our findings on [arXiv](https://arxiv.org/abs/{{ paper_id }}). -->
66
+
67
+ ## 📥 Model Zoo
68
+
69
+ | Model | Params | Training | Steps | CFG | Task | Hugging Face | Demo |
70
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
71
+ | **Boogu-Image-0.1-Base** | 10B | Joint Training | 25~50 | 2.0~5.0<br>(e.g., 4.0) | T2I | [![HF](https://img.shields.io/badge/%F0%9F%A4%97-Checkpoint-yellow)](https://huggingface.co/Boogu/Boogu-Image-0.1-Base) | [![Demo](https://img.shields.io/badge/🎨-Demo-ff69b4)](http://demo-base.boogu.org/) |
72
+ | **Boogu-Image-0.1-Edit** | 10B | Joint Training | 25~50 | 2.0~5.0<br>(e.g., 5.0) | TI2I | [![HF](https://img.shields.io/badge/%F0%9F%A4%97-Checkpoint-yellow)](https://huggingface.co/Boogu/Boogu-Image-0.1-Edit) | [![Demo](https://img.shields.io/badge/🖌️-Demo-ff8c00)](http://demo-edit.boogu.org/) |
73
+ | **Boogu-Image-0.1-Turbo** | 10B | + Decoupled DMD | 4 | 0.0 | T2I | [![HF](https://img.shields.io/badge/%F0%9F%A4%97-Checkpoint-yellow)](https://huggingface.co/Boogu/Boogu-Image-0.1-Turbo) | [![Demo](https://img.shields.io/badge/⚡-Demo-9b59b6)](http://demo-turbo.boogu.org/) |
74
+
75
+ - **Boogu-Image-0.1-Base**: Foundation model with strong **diversity** and **controllability** — ideal for **fine-tuning** and downstream development. Mainly intended for **ultra-dense text rendering**; for photorealism, Turbo is usually the better default.
76
+ - **Boogu-Image-0.1-Edit**: Image editing and transformation variant.
77
+ - **Boogu-Image-0.1-Turbo**: Distilled variant with the **same parameter count**, typically requiring only **3~4 steps**. Focuses on **high-quality generation** and photorealism while preserving bilingual text rendering and prompt adherence.
78
+
79
+ ## 🛠️ Installation
80
+
81
+ > **Tested environment:** Python 3.10 · CUDA 12.6 · PyTorch 2.7.1
82
+
83
+ ```bash
84
+ # Use a brand new conda environment
85
+ conda create -y -n boogu python=3.10
86
+ conda activate boogu
87
+ # Instal necessary dependencies
88
+ # PyTorch up to 2.11.0 with CUDA up to 12.8 is supported
89
+ # Check `requirements/<torch>_<cuda>.txt`
90
+ pip install -r requirements/torch2.7-cu126.txt
91
+ pip install -e .
92
+ python utils/get_flash_attn.py
93
+ ```
94
+
95
+ or
96
+
97
+ ```bash
98
+ bash quick_start.sh
99
+ conda activate boogu
100
+ ```
101
+
102
+ ### Download Checkpoints
103
+ Download the model weights into a local `models/` directory before running inference. We recommend using the official Hugging Face CLI:
104
+
105
+ ```bash
106
+ pip install -U "huggingface_hub[cli]"
107
+
108
+ # Download to ./models/<model-name>
109
+ huggingface-cli download Boogu/Boogu-Image-0.1-Base --local-dir models/Boogu-Image-0.1-Base
110
+ huggingface-cli download Boogu/Boogu-Image-0.1-Turbo --local-dir models/Boogu-Image-0.1-Turbo
111
+ huggingface-cli download Boogu/Boogu-Image-0.1-Edit --local-dir models/Boogu-Image-0.1-Edit
112
+ ```
113
+
114
+
115
+
116
+ Example layout after download:
117
+ ```
118
+ models/
119
+ └── Boogu-Image-0.1-Base/
120
+ ├── model_index.json
121
+ ├── mllm
122
+ ├── processor
123
+ ├── scheduler
124
+ ├── transformer
125
+ └── vae
126
+ ```
127
+
128
+ Then point inference to the local path via `--model models/Boogu-Image-0.1-Base`.
129
+
130
+ ### Flash Attention
131
+
132
+ This repository provides `utils/get_flash_attn.py` to automatically install a compatible `flash-attn` wheel for your environment.
133
+
134
+ Requirements:
135
+ - Python and PyTorch with CUDA already installed
136
+ - Linux x86_64
137
+
138
+ ```bash
139
+ # Auto: detect environment, download a prebuilt wheel, fallback to source build
140
+ python utils/get_flash_attn.py
141
+
142
+ # Force source compilation
143
+ python utils/get_flash_attn.py --build
144
+ ```
145
+
146
+ The script first searches [`mjun0812/flash-attention-prebuild-wheels`](https://github.com/mjun0812/flash-attention-prebuild-wheels), then tries official [`Dao-AILab/flash-attention`](https://github.com/Dao-AILab/flash-attention) release wheels with both cxx11abi variants, and finally falls back to source compilation via `pip install flash-attn --no-build-isolation`.
147
+
148
+
149
+ ## 🚀 Quick Start
150
+
151
+ ### PyTorch Native T2I Inference
152
+
153
+ ```bash
154
+ export device="cuda:0" # Required
155
+
156
+ # Prompt enhancement is powered by an instruction reasoner, also called the rewriter.
157
+ # We provide two ways to use it:
158
+ #
159
+ # 1. Standalone external rewriter:
160
+ # See utils/t2i_external_prompt_rewriter.py. This is a pure external mode example and
161
+ # requires enough GPU memory, without advanced memory management.
162
+ # python utils/t2i_external_prompt_rewriter.py --prompt "draw a cat" --model /path/to/Qwen3-VL-32B-Instruct --lang en
163
+ #
164
+ # 2. Pipeline-integrated rewriter:
165
+ # See the scripts under `demo_scripts` whose names contain "reasoning".
166
+ # For example: demo_scripts/demo_t2i_local_reasoning.sh
167
+ # This mode supports more flexible memory management. Set the generation and
168
+ # rewriter devices manually, then pass them to inference.py:
169
+ # export device="cuda:0"
170
+ # export rewriter_device="cuda:1"
171
+ # python inference.py --device $device --rewriter_device $rewriter_device ...
172
+ # For more details, see INFERENCE_GUIDE.md.
173
+
174
+ python inference.py \
175
+ --pretrained_pipeline_name_or_path "models/Boogu-Image-0.1-Base" \
176
+ --instruction "一幅国风琉金风格的山水画作,展现了桂林山水在金光普照下的壮丽景象。远山层叠,江水如镜,山峰边缘勾勒着发光的金色线条。画面采用石青石绿岩彩与鎏金质感相结合,局部有厚涂油画笔触,空中飘浮着金色粒子,营造出梦幻朦胧而又磅礴大气的意境。" \
177
+ --num_inference_steps 50 \
178
+ --height 1024 --width 1024 \
179
+ --text_guidance_scale 4.0 \
180
+ --output_image_path "outputs/test_base/out_1.png" \
181
+ --device "$device"
182
+ ```
183
+
184
+ ### Hardware Notes
185
+
186
+ > 📖 For full CLI options, device setup, offload strategies, caching acceleration, Torch Compile, FP8, and batch inference details, see [**INFERENCE_GUIDE.md**](./INFERENCE_GUIDE.md).
187
+ > Torch Compile note: `--enable_torch_compile` can occasionally produce all-black outputs on some GPUs/models. If that happens, disable it first.
188
+
189
+ | VRAM | Recommended Config (T2I 1K) | Recommended Config (T2I 2K) |
190
+ |------|-----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
191
+ | 12GB | Unquantized: `--enable_sequential_cpu_offload_flag`<br>Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights` | Unquantized: `--enable_sequential_cpu_offload_flag`<br>Quantized: `--enable_group_offload_flag --use_fp8_weights` |
192
+ | 16GB | Unquantized: `--enable_sequential_cpu_offload_flag`<br>Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights` | Unquantized: `--enable_sequential_cpu_offload_flag`<br>Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights` |
193
+ | 24GB | Unquantized: `--enable_model_cpu_offload_flag`<br>Quantized `--use_fp8_weights` | `--enable_model_cpu_offload_flag` |
194
+ | 32GB | Unquantized: `--enable_model_cpu_offload_flag`<br>Quantized: `--use_fp8_weights` | Unquantized: `--enable_model_cpu_offload_flag`<br>Quantized: `--use_fp8_weights` |
195
+ | 40GB | Base Model | Unquantized: `--enable_model_cpu_offload_flag`<br>Quantized: `--use_fp8_weights` |
196
+ | 80GB | Base Model | Base Model |
197
+
198
+ ## ⚠️ Responsible AI & Limitations
199
+
200
+ **Boogu-Image-0.1** is released for **research purposes** and is not intended for production deployment without additional safeguards. We took responsible-AI considerations into account during data curation, training, and evaluation; however the model may still produce outputs that are inaccurate, biased, or otherwise inappropriate.
201
+
202
+ ### Known Limitations
203
+
204
+ **🌍 World Knowledge Gap**
205
+ - For tasks requiring rich common sense, domain knowledge, real brands or people, famous landmarks, celebrities, products, or complex contextual understanding, Boogu still has a clear gap from strong closed-source systems
206
+ - This capability is extraordinarily expensive to measure; even Arena-style evaluation struggles to assess it fully, so existing benchmarks barely quantify this dimension and the real gap is likely larger than measured scores suggest
207
+
208
+ **🖼️ Image-to-Image Consistency & In-Context Scenarios**
209
+ - For editing tasks requiring strict preservation of the input subject, identity, layout, or fine details, Boogu's image-to-image consistency is still not stable enough
210
+ - Because our image-to-image capability focuses more on photography and text-generation applications, Boogu still trails **Seedream 5.0** and **Nano Banana Pro** in some in-context generation scenarios
211
+
212
+ **📝 Text Rendering Stability**
213
+ - Boogu can handle many Chinese and English text scenarios, but long text, dense typography, small fonts, and complex design layouts can still produce typos, missing characters, or layout drift
214
+ - Text rendering is currently focused on Chinese and English; other languages are not specifically optimized and may degrade noticeably
215
+
216
+ **🦴 Body Structure in Complex Poses**
217
+ - In multi-person interaction, occlusion, exaggerated motion, or unusual viewpoints, hands, limbs, and body structure may still become unnatural or inconsistent
218
+
219
+ **👤 Small Faces & Small Limbs**
220
+ - Because we use the open-source **FLUX.1 VAE**, reconstruction loss is relatively large, so details such as small faces, small limbs, eyes, and text may still show artifacts or instability
221
+
222
+ **📦 Limited Release Scope**
223
+ - Due to resource constraints, engineering complexity, and release boundaries, we are not able to open-source every training and system detail
224
+ - The current open-source release aims to balance reproducibility, usability, and sustainable maintenance while providing a reliable starting point for community research and improvement
225
+
226
+ Downstream users are responsible for applying content moderation, validation, and compliance checks appropriate to their use case.
227
+
228
+
229
+ ## 🙏 Acknowledgements
230
+
231
+ Closed-source systems such as [GPT-Image](https://openai.com/index/introducing-chatgpt-images-2-0/), [Nano Banana](https://gemini.google/overview/image-generation/), and the [Seedream](https://seed.bytedance.com/en/seedream5_0_lite) series helped us understand the frontier capabilities and practical boundaries of unified understanding-and-generation systems. We thank the [Qwen-Image](https://github.com/QwenLM/Qwen-Image), [Z-Image](https://github.com/Tongyi-MAI/Z-Image), [OmniGen2](https://github.com/VectorSpaceLab/OmniGen2), [FLUX](https://github.com/black-forest-labs/flux), and broader open-source communities for the foundations they provide, and [DeepSeek](https://www.deepseek.com) for strong open-source understanding models that support open-source unified multimodal systems.
232
+
233
+
234
+ ## 📄 License
235
+
236
+ This project is released under the [Apache-2.0 License](LICENSE).