asats
/

thumbnail-vlm-janus-pro

@@ -1,68 +1,232 @@
-# Thumbnail VLM - Janus-Pro-7B Fine-tuned
-A Vision-Language Model for **professional thumbnail generation** that accepts flexible multimodal inputs.
 ## 🎯 Capabilities
 | Input Mode | Description | Example |
 |---|---|---|
-| **Text → Thumbnail** | Generate thumbnail from text description | "Epic gaming video about Minecraft" → 🖼️ |
 | **Image → Thumbnail** | Generate thumbnail from reference image | 📷 → 🖼️ |
-| **Text + Image → Thumbnail** | Generate thumbnail from both text and image | "Make it a cooking thumbnail" + 📷 → 🖼️ |
 ## 🏗️ Architecture
-- **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B)
-- **Architecture:** DeepSeek-LLM-7B + SigLIP (understanding) + VQ-16 (generation)
 - **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095)
-- **Training Data:** PosterCraft/Poster100K + synthetic thumbnail prompts (~10K samples)
-- **Image Resolution:** 384×384 (576 VQ tokens, codebook=16384)
-## 📊 Training Details
-| Parameter | Value |
-|---|---|
-| Learning Rate | 5e-6 |
-| Epochs | 3 |
-| Effective Batch Size | 16 |
-| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
-| CFG Prompt Masking | 10% |
-| Precision | bfloat16 |
 ## 🚀 Quick Start
 ```python
 import torch
 from transformers import AutoModelForCausalLM
 from janus.models import MultiModalityCausalLM, VLChatProcessor
-# Install Janus first: pip install -e . (from https://github.com/deepseek-ai/Janus)
 model_path = "asats/thumbnail-vlm-janus-pro"
 processor = VLChatProcessor.from_pretrained(model_path)
 model = AutoModelForCausalLM.from_pretrained(
     model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
 ).cuda().eval()
-# Generate thumbnail from text
-prompt = "Professional tech review thumbnail with iPhone 16, dramatic lighting, text 'BEST PHONE 2025'"
-# ... (see inference_janus.py for full generation code)
 ```
-## 📚 Citation
-```bibtex
-@misc{thumbnail-vlm-2025,
-  title={Thumbnail VLM: Fine-tuned Janus-Pro-7B for Thumbnail Generation},
-  year={2025},
-  base_model={deepseek-ai/Janus-Pro-7B},
-  dataset={PosterCraft/Poster100K},
-}
 ```
-## 🔗 Related
-- [Janus-Pro Paper](https://arxiv.org/abs/2501.17811)
-- [Janus-4o Paper](https://arxiv.org/abs/2506.18095)
-- [PosterCraft Dataset](https://huggingface.co/datasets/PosterCraft/Poster100K)
-- [ShareGPT-4o-Image](https://huggingface.co/datasets/FreedomIntelligence/ShareGPT-4o-Image)

+---
+base_model:
+- deepseek-ai/Janus-Pro-7B
+datasets:
+- PosterCraft/Poster100K
+- FreedomIntelligence/ShareGPT-4o-Image
+language:
+- en
+library_name: transformers
+license: mit
+pipeline_tag: any-to-any
+tags:
+- text-to-image
+- image-to-image
+- text-and-image-to-image
+- multimodal
+- unified-model
+- thumbnail-generation
+- vlm
+---
+# 🎨 Thumbnail VLM — Janus-Pro-7B for Thumbnail Generation
+A **Vision-Language Model** fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.
 ## 🎯 Capabilities
 | Input Mode | Description | Example |
 |---|---|---|
+| **Text → Thumbnail** | Generate thumbnail from text description | `"Epic gaming video about Minecraft"` → 🖼️ |
 | **Image → Thumbnail** | Generate thumbnail from reference image | 📷 → 🖼️ |
+| **Text + Image → Thumbnail** | Generate thumbnail from both | `"Make a cooking thumbnail"` + 📷 → 🖼️ |
 ## 🏗️ Architecture
+```
+┌─────────────────────────────────────────────────┐
+│              Janus-Pro-7B Architecture           │
+├─────────────────────────────────────────────────┤
+│                                                  │
+│  Input Text ──→ Tokenizer ──→ ┐                 │
+│                                ├──→ DeepSeek-LLM │
+│  Input Image ──→ SigLIP ──→  ┘    (7B, 30 layers│
+│                                     4096-dim)    │
+│                                                  │
+│  DeepSeek-LLM ──→ gen_head ──→ VQ Logits        │
+│                    (4096→16384)                   │
+│                                                  │
+│  VQ Tokens ──→ VQ-16 Decoder ──→ Output Image   │
+│                (16384 codebook,   (384×384)      │
+│                 576 tokens/img)                   │
+└─────────────────────────────────────────────────┘
+```
+- **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B) (7.4B params)
+- **Understanding Encoder:** SigLIP-Large (384×384, 576 tokens)
+- **Generation Tokenizer:** VQ-16 (codebook=16384, 576 discrete tokens per image)
 - **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095)
+## 📊 Training Recipe
+| Parameter | Value | Source |
+|---|---|---|
+| Base model | `deepseek-ai/Janus-Pro-7B` | Janus-4o paper |
+| Learning Rate | 5e-6 | Janus-4o §3.3 |
+| Epochs | 3 | Janus-4o §3.3 |
+| Effective Batch Size | 16 (1×16 grad accum) | Adapted from paper's 128 |
+| Optimizer | AdamW (β₁=0.9, β₂=0.95) | Janus-4o |
+| CFG Prompt Masking | 10% | Janus-4o §3.1 |
+| Precision | bfloat16 | Model default |
+| Image Resolution | 384×384 | Architecture constraint |
+| Frozen | SigLIP + VQ Tokenizer | Efficiency |
+| Trainable | LLM + gen_head + aligners | ~6.5B params |
+### Training Data
+| Dataset | Samples | Type |
+|---|---|---|
+| [PosterCraft/Poster100K](https://huggingface.co/datasets/PosterCraft/Poster100K) | 8,000 | Movie/TV posters (T2I) |
+| Synthetic thumbnail prompts | 2,000 | YouTube-style prompts (T2I) |
+| **Total** | **~10,000** | |
 ## 🚀 Quick Start
+### Installation
+```bash
+# Install Janus library
+git clone https://github.com/deepseek-ai/Janus.git
+cd Janus && pip install -e .
+# Install other dependencies
+pip install torch transformers Pillow numpy
+```
+### Text → Thumbnail
 ```python
 import torch
+import numpy as np
+import PIL.Image
 from transformers import AutoModelForCausalLM
 from janus.models import MultiModalityCausalLM, VLChatProcessor
 model_path = "asats/thumbnail-vlm-janus-pro"
 processor = VLChatProcessor.from_pretrained(model_path)
 model = AutoModelForCausalLM.from_pretrained(
     model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
 ).cuda().eval()
+# Generate thumbnail
+prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
+conversation = [
+    {"role": "<|User|>", "content": prompt},
+    {"role": "<|Assistant|>", "content": ""},
+]
+sft_format = processor.apply_sft_template_for_multi_turn_prompts(
+    conversations=conversation, sft_format=processor.sft_format, system_prompt=""
+)
+prompt_text = sft_format + processor.image_start_tag
+with torch.inference_mode():
+    input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
+    tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
+    tokens[0] = input_ids  # conditional
+    tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id  # unconditional
+    inputs_embeds = model.language_model.get_input_embeddings()(tokens)
+    generated = torch.zeros((1, 576), dtype=torch.int).cuda()
+    past_kv = None
+    for t in range(576):
+        outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
+        past_kv = outputs.past_key_values
+        logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
+        guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
+        next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
+        generated[:, t] = next_tok.squeeze(-1)
+        img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
+        inputs_embeds = img_emb.unsqueeze(1)
+    dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
+    img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
+    PIL.Image.fromarray(img[0]).save("thumbnail.png")
+```
+### Image → Thumbnail
+```python
+# Uses model's understanding to caption, then generates
+python scripts/inference_janus.py --mode image --input_image photo.jpg
 ```
+### Text + Image → Thumbnail
+```python
+# Uses both text instruction and reference image
+python scripts/inference_janus.py --mode both \
+    --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
+    --input_image food_photo.jpg
 ```
+## 🔧 Training from Scratch
+### Option 1: HuggingFace Jobs (Recommended)
+```python
+# Launch via HF Jobs API
+from huggingface_hub import HfApi
+api = HfApi()
+# Requires: a100-large hardware, 8h timeout
+# Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm,
+#               trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git
+```
+### Option 2: Local Training
+```bash
+# Clone repo and install
+git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
+pip install torch transformers datasets Pillow numpy tqdm trackio accelerate
+# Run training (needs ~40GB VRAM, A100 recommended)
+python run_training.py
+```
+### Option 3: Alternative — OmniGen LoRA (Lower VRAM)
+For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):
+```bash
+pip install OmniGen accelerate peft
+accelerate launch train_omnigen.py \
+    --model_name_or_path Shitao/OmniGen-v1 \
+    --json_file train.jsonl \
+    --image_path ./images \
+    --use_lora --lora_rank 8 \
+    --lr 1e-3 --epochs 3
+```
+## 📁 Repository Structure
+```
+├── README.md                    # This file
+├── scripts/
+│   ├── run_training.py          # End-to-end training pipeline (data prep + train + eval)
+│   ├── inference_janus.py       # Inference for all 3 input modes
+│   ├── train_janus.py           # Modular Janus training script
+│   ├── train_omnigen.py         # Alternative OmniGen LoRA training
+│   └── prepare_data.py          # Data preparation utilities
+```
+## 📈 Training Data Sources
+| Dataset | Size | Content | Format |
+|---|---|---|---|
+| [PosterCraft/Poster100K](https://hf.co/datasets/PosterCraft/Poster100K) | 93K | Movie/TV posters | image + rich caption |
+| [ShareGPT-4o-Image](https://hf.co/datasets/FreedomIntelligence/ShareGPT-4o-Image) | 91K | GPT-4o synthetic pairs | prompt + image |
+| [CSU-JPG/TextAtlas5M](https://hf.co/datasets/CSU-JPG/TextAtlas5M) | 5M+ | Text-in-image data | image + annotation |
+| [fantasyfish/laion-art](https://hf.co/datasets/fantasyfish/laion-art) | 20K | High-aesthetic images | image + text |
+## 📚 References
+- **Janus-Pro:** [arxiv:2501.17811](https://arxiv.org/abs/2501.17811) — Unified understanding and generation
+- **Janus-4o:** [arxiv:2506.18095](https://arxiv.org/abs/2506.18095) — ShareGPT-4o-Image fine-tuning recipe
+- **OmniGen:** [arxiv:2409.11340](https://arxiv.org/abs/2409.11340) — Unified image generation (alternative)
+- **PosterCraft:** [arxiv:2506.10741](https://arxiv.org/abs/2506.10741) — Poster dataset and generation
+## ⚖️ License
+MIT (code) + [DeepSeek Model License](https://github.com/deepseek-ai/Janus/blob/main/LICENSE) (model weights)