asats commited on
Commit
d448ff0
Β·
verified Β·
1 Parent(s): 5ce9242

Update README with comprehensive documentation

Browse files
Files changed (1) hide show
  1. README.md +200 -36
README.md CHANGED
@@ -1,68 +1,232 @@
1
- # Thumbnail VLM - Janus-Pro-7B Fine-tuned
2
-
3
- A Vision-Language Model for **professional thumbnail generation** that accepts flexible multimodal inputs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## 🎯 Capabilities
6
 
7
  | Input Mode | Description | Example |
8
  |---|---|---|
9
- | **Text β†’ Thumbnail** | Generate thumbnail from text description | "Epic gaming video about Minecraft" β†’ πŸ–ΌοΈ |
10
  | **Image β†’ Thumbnail** | Generate thumbnail from reference image | πŸ“· β†’ πŸ–ΌοΈ |
11
- | **Text + Image β†’ Thumbnail** | Generate thumbnail from both text and image | "Make it a cooking thumbnail" + πŸ“· β†’ πŸ–ΌοΈ |
12
 
13
  ## πŸ—οΈ Architecture
14
 
15
- - **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
16
- - **Architecture:** DeepSeek-LLM-7B + SigLIP (understanding) + VQ-16 (generation)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  - **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095)
18
- - **Training Data:** PosterCraft/Poster100K + synthetic thumbnail prompts (~10K samples)
19
- - **Image Resolution:** 384Γ—384 (576 VQ tokens, codebook=16384)
20
 
21
- ## πŸ“Š Training Details
22
 
23
- | Parameter | Value |
24
- |---|---|
25
- | Learning Rate | 5e-6 |
26
- | Epochs | 3 |
27
- | Effective Batch Size | 16 |
28
- | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95) |
29
- | CFG Prompt Masking | 10% |
30
- | Precision | bfloat16 |
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## πŸš€ Quick Start
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ```python
35
  import torch
 
 
36
  from transformers import AutoModelForCausalLM
37
  from janus.models import MultiModalityCausalLM, VLChatProcessor
38
 
39
- # Install Janus first: pip install -e . (from https://github.com/deepseek-ai/Janus)
40
-
41
  model_path = "asats/thumbnail-vlm-janus-pro"
42
  processor = VLChatProcessor.from_pretrained(model_path)
43
  model = AutoModelForCausalLM.from_pretrained(
44
  model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
45
  ).cuda().eval()
46
 
47
- # Generate thumbnail from text
48
- prompt = "Professional tech review thumbnail with iPhone 16, dramatic lighting, text 'BEST PHONE 2025'"
49
- # ... (see inference_janus.py for full generation code)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  ```
51
 
52
- ## πŸ“š Citation
53
 
54
- ```bibtex
55
- @misc{thumbnail-vlm-2025,
56
- title={Thumbnail VLM: Fine-tuned Janus-Pro-7B for Thumbnail Generation},
57
- year={2025},
58
- base_model={deepseek-ai/Janus-Pro-7B},
59
- dataset={PosterCraft/Poster100K},
60
- }
61
  ```
62
 
63
- ## πŸ”— Related
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
- - [Janus-Pro Paper](https://arxiv.org/abs/2501.17811)
66
- - [Janus-4o Paper](https://arxiv.org/abs/2506.18095)
67
- - [PosterCraft Dataset](https://huggingface.co/datasets/PosterCraft/Poster100K)
68
- - [ShareGPT-4o-Image](https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-4o-Image)
 
1
+ ---
2
+ base_model:
3
+ - deepseek-ai/Janus-Pro-7B
4
+ datasets:
5
+ - PosterCraft/Poster100K
6
+ - FreedomIntelligence/ShareGPT-4o-Image
7
+ language:
8
+ - en
9
+ library_name: transformers
10
+ license: mit
11
+ pipeline_tag: any-to-any
12
+ tags:
13
+ - text-to-image
14
+ - image-to-image
15
+ - text-and-image-to-image
16
+ - multimodal
17
+ - unified-model
18
+ - thumbnail-generation
19
+ - vlm
20
+ ---
21
+
22
+ # 🎨 Thumbnail VLM β€” Janus-Pro-7B for Thumbnail Generation
23
+
24
+ A **Vision-Language Model** fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.
25
 
26
  ## 🎯 Capabilities
27
 
28
  | Input Mode | Description | Example |
29
  |---|---|---|
30
+ | **Text β†’ Thumbnail** | Generate thumbnail from text description | `"Epic gaming video about Minecraft"` β†’ πŸ–ΌοΈ |
31
  | **Image β†’ Thumbnail** | Generate thumbnail from reference image | πŸ“· β†’ πŸ–ΌοΈ |
32
+ | **Text + Image β†’ Thumbnail** | Generate thumbnail from both | `"Make a cooking thumbnail"` + πŸ“· β†’ πŸ–ΌοΈ |
33
 
34
  ## πŸ—οΈ Architecture
35
 
36
+ ```
37
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
38
+ β”‚ Janus-Pro-7B Architecture β”‚
39
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
40
+ β”‚ β”‚
41
+ β”‚ Input Text ──→ Tokenizer ──→ ┐ β”‚
42
+ β”‚ β”œβ”€β”€β†’ DeepSeek-LLM β”‚
43
+ β”‚ Input Image ──→ SigLIP ──→ β”˜ (7B, 30 layersβ”‚
44
+ β”‚ 4096-dim) β”‚
45
+ β”‚ β”‚
46
+ β”‚ DeepSeek-LLM ──→ gen_head ──→ VQ Logits β”‚
47
+ β”‚ (4096β†’16384) β”‚
48
+ β”‚ β”‚
49
+ β”‚ VQ Tokens ──→ VQ-16 Decoder ──→ Output Image β”‚
50
+ β”‚ (16384 codebook, (384Γ—384) β”‚
51
+ β”‚ 576 tokens/img) β”‚
52
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
53
+ ```
54
+
55
+ - **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B) (7.4B params)
56
+ - **Understanding Encoder:** SigLIP-Large (384Γ—384, 576 tokens)
57
+ - **Generation Tokenizer:** VQ-16 (codebook=16384, 576 discrete tokens per image)
58
  - **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095)
 
 
59
 
60
+ ## πŸ“Š Training Recipe
61
 
62
+ | Parameter | Value | Source |
63
+ |---|---|---|
64
+ | Base model | `deepseek-ai/Janus-Pro-7B` | Janus-4o paper |
65
+ | Learning Rate | 5e-6 | Janus-4o Β§3.3 |
66
+ | Epochs | 3 | Janus-4o Β§3.3 |
67
+ | Effective Batch Size | 16 (1Γ—16 grad accum) | Adapted from paper's 128 |
68
+ | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95) | Janus-4o |
69
+ | CFG Prompt Masking | 10% | Janus-4o Β§3.1 |
70
+ | Precision | bfloat16 | Model default |
71
+ | Image Resolution | 384Γ—384 | Architecture constraint |
72
+ | Frozen | SigLIP + VQ Tokenizer | Efficiency |
73
+ | Trainable | LLM + gen_head + aligners | ~6.5B params |
74
+
75
+ ### Training Data
76
+
77
+ | Dataset | Samples | Type |
78
+ |---|---|---|
79
+ | [PosterCraft/Poster100K](https://huggingface.co/datasets/PosterCraft/Poster100K) | 8,000 | Movie/TV posters (T2I) |
80
+ | Synthetic thumbnail prompts | 2,000 | YouTube-style prompts (T2I) |
81
+ | **Total** | **~10,000** | |
82
 
83
  ## πŸš€ Quick Start
84
 
85
+ ### Installation
86
+
87
+ ```bash
88
+ # Install Janus library
89
+ git clone https://github.com/deepseek-ai/Janus.git
90
+ cd Janus && pip install -e .
91
+
92
+ # Install other dependencies
93
+ pip install torch transformers Pillow numpy
94
+ ```
95
+
96
+ ### Text β†’ Thumbnail
97
+
98
  ```python
99
  import torch
100
+ import numpy as np
101
+ import PIL.Image
102
  from transformers import AutoModelForCausalLM
103
  from janus.models import MultiModalityCausalLM, VLChatProcessor
104
 
 
 
105
  model_path = "asats/thumbnail-vlm-janus-pro"
106
  processor = VLChatProcessor.from_pretrained(model_path)
107
  model = AutoModelForCausalLM.from_pretrained(
108
  model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
109
  ).cuda().eval()
110
 
111
+ # Generate thumbnail
112
+ prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
113
+ conversation = [
114
+ {"role": "<|User|>", "content": prompt},
115
+ {"role": "<|Assistant|>", "content": ""},
116
+ ]
117
+ sft_format = processor.apply_sft_template_for_multi_turn_prompts(
118
+ conversations=conversation, sft_format=processor.sft_format, system_prompt=""
119
+ )
120
+ prompt_text = sft_format + processor.image_start_tag
121
+
122
+ with torch.inference_mode():
123
+ input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
124
+ tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
125
+ tokens[0] = input_ids # conditional
126
+ tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id # unconditional
127
+
128
+ inputs_embeds = model.language_model.get_input_embeddings()(tokens)
129
+ generated = torch.zeros((1, 576), dtype=torch.int).cuda()
130
+
131
+ past_kv = None
132
+ for t in range(576):
133
+ outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
134
+ past_kv = outputs.past_key_values
135
+ logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
136
+ guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
137
+ next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
138
+ generated[:, t] = next_tok.squeeze(-1)
139
+ img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
140
+ inputs_embeds = img_emb.unsqueeze(1)
141
+
142
+ dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
143
+ img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
144
+ PIL.Image.fromarray(img[0]).save("thumbnail.png")
145
+ ```
146
+
147
+ ### Image β†’ Thumbnail
148
+
149
+ ```python
150
+ # Uses model's understanding to caption, then generates
151
+ python scripts/inference_janus.py --mode image --input_image photo.jpg
152
  ```
153
 
154
+ ### Text + Image β†’ Thumbnail
155
 
156
+ ```python
157
+ # Uses both text instruction and reference image
158
+ python scripts/inference_janus.py --mode both \
159
+ --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
160
+ --input_image food_photo.jpg
 
 
161
  ```
162
 
163
+ ## πŸ”§ Training from Scratch
164
+
165
+ ### Option 1: HuggingFace Jobs (Recommended)
166
+
167
+ ```python
168
+ # Launch via HF Jobs API
169
+ from huggingface_hub import HfApi
170
+ api = HfApi()
171
+
172
+ # Requires: a100-large hardware, 8h timeout
173
+ # Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm,
174
+ # trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git
175
+ ```
176
+
177
+ ### Option 2: Local Training
178
+
179
+ ```bash
180
+ # Clone repo and install
181
+ git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
182
+ pip install torch transformers datasets Pillow numpy tqdm trackio accelerate
183
+
184
+ # Run training (needs ~40GB VRAM, A100 recommended)
185
+ python run_training.py
186
+ ```
187
+
188
+ ### Option 3: Alternative β€” OmniGen LoRA (Lower VRAM)
189
+
190
+ For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):
191
+
192
+ ```bash
193
+ pip install OmniGen accelerate peft
194
+ accelerate launch train_omnigen.py \
195
+ --model_name_or_path Shitao/OmniGen-v1 \
196
+ --json_file train.jsonl \
197
+ --image_path ./images \
198
+ --use_lora --lora_rank 8 \
199
+ --lr 1e-3 --epochs 3
200
+ ```
201
+
202
+ ## πŸ“ Repository Structure
203
+
204
+ ```
205
+ β”œβ”€β”€ README.md # This file
206
+ β”œβ”€β”€ scripts/
207
+ β”‚ β”œβ”€β”€ run_training.py # End-to-end training pipeline (data prep + train + eval)
208
+ β”‚ β”œβ”€β”€ inference_janus.py # Inference for all 3 input modes
209
+ β”‚ β”œβ”€β”€ train_janus.py # Modular Janus training script
210
+ β”‚ β”œβ”€β”€ train_omnigen.py # Alternative OmniGen LoRA training
211
+ β”‚ └── prepare_data.py # Data preparation utilities
212
+ ```
213
+
214
+ ## πŸ“ˆ Training Data Sources
215
+
216
+ | Dataset | Size | Content | Format |
217
+ |---|---|---|---|
218
+ | [PosterCraft/Poster100K](https://hf.co/datasets/PosterCraft/Poster100K) | 93K | Movie/TV posters | image + rich caption |
219
+ | [ShareGPT-4o-Image](https://hf.co/datasets/FreedomIntelligence/ShareGPT-4o-Image) | 91K | GPT-4o synthetic pairs | prompt + image |
220
+ | [CSU-JPG/TextAtlas5M](https://hf.co/datasets/CSU-JPG/TextAtlas5M) | 5M+ | Text-in-image data | image + annotation |
221
+ | [fantasyfish/laion-art](https://hf.co/datasets/fantasyfish/laion-art) | 20K | High-aesthetic images | image + text |
222
+
223
+ ## πŸ“š References
224
+
225
+ - **Janus-Pro:** [arxiv:2501.17811](https://arxiv.org/abs/2501.17811) β€” Unified understanding and generation
226
+ - **Janus-4o:** [arxiv:2506.18095](https://arxiv.org/abs/2506.18095) β€” ShareGPT-4o-Image fine-tuning recipe
227
+ - **OmniGen:** [arxiv:2409.11340](https://arxiv.org/abs/2409.11340) β€” Unified image generation (alternative)
228
+ - **PosterCraft:** [arxiv:2506.10741](https://arxiv.org/abs/2506.10741) β€” Poster dataset and generation
229
+
230
+ ## βš–οΈ License
231
 
232
+ MIT (code) + [DeepSeek Model License](https://github.com/deepseek-ai/Janus/blob/main/LICENSE) (model weights)