---
base_model:
- deepseek-ai/Janus-Pro-7B
datasets:
- PosterCraft/Poster100K
- FreedomIntelligence/ShareGPT-4o-Image
language:
- en
library_name: transformers
license: mit
pipeline_tag: any-to-any
tags:
- text-to-image
- image-to-image
- text-and-image-to-image
- multimodal
- unified-model
- thumbnail-generation
- vlm
---

# 🎨 Thumbnail VLM — Janus-Pro-7B for Thumbnail Generation

A **Vision-Language Model** fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image.

## 🎯 Capabilities

| Input Mode | Description | Example |
|---|---|---|
| **Text → Thumbnail** | Generate thumbnail from text description | `"Epic gaming video about Minecraft"` → 🖼️ |
| **Image → Thumbnail** | Generate thumbnail from reference image | 📷 → 🖼️ |
| **Text + Image → Thumbnail** | Generate thumbnail from both | `"Make a cooking thumbnail"` + 📷 → 🖼️ |

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────┐
│              Janus-Pro-7B Architecture           │
├─────────────────────────────────────────────────┤
│                                                  │
│  Input Text ──→ Tokenizer ──→ ┐                 │
│                                ├──→ DeepSeek-LLM │
│  Input Image ──→ SigLIP ──→  ┘    (7B, 30 layers│
│                                     4096-dim)    │
│                                                  │
│  DeepSeek-LLM ──→ gen_head ──→ VQ Logits        │
│                    (4096→16384)                   │
│                                                  │
│  VQ Tokens ──→ VQ-16 Decoder ──→ Output Image   │
│                (16384 codebook,   (384×384)      │
│                 576 tokens/img)                   │
└─────────────────────────────────────────────────┘
```

- **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B) (7.4B params)
- **Understanding Encoder:** SigLIP-Large (384×384, 576 tokens)
- **Generation Tokenizer:** VQ-16 (codebook=16384, 576 discrete tokens per image)
- **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095)

## 📊 Training Recipe

| Parameter | Value | Source |
|---|---|---|
| Base model | `deepseek-ai/Janus-Pro-7B` | Janus-4o paper |
| Learning Rate | 5e-6 | Janus-4o §3.3 |
| Epochs | 3 | Janus-4o §3.3 |
| Effective Batch Size | 16 (1×16 grad accum) | Adapted from paper's 128 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) | Janus-4o |
| CFG Prompt Masking | 10% | Janus-4o §3.1 |
| Precision | bfloat16 | Model default |
| Image Resolution | 384×384 | Architecture constraint |
| Frozen | SigLIP + VQ Tokenizer | Efficiency |
| Trainable | LLM + gen_head + aligners | ~6.5B params |

### Training Data

| Dataset | Samples | Type |
|---|---|---|
| [PosterCraft/Poster100K](https://huggingface.co/datasets/PosterCraft/Poster100K) | 8,000 | Movie/TV posters (T2I) |
| Synthetic thumbnail prompts | 2,000 | YouTube-style prompts (T2I) |
| **Total** | **~10,000** | |

## 🚀 Quick Start

### Installation

```bash
# Install Janus library
git clone https://github.com/deepseek-ai/Janus.git
cd Janus && pip install -e .

# Install other dependencies
pip install torch transformers Pillow numpy
```

### Text → Thumbnail

```python
import torch
import numpy as np
import PIL.Image
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

model_path = "asats/thumbnail-vlm-janus-pro"
processor = VLChatProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()

# Generate thumbnail
prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'"
conversation = [
    {"role": "<|User|>", "content": prompt},
    {"role": "<|Assistant|>", "content": ""},
]
sft_format = processor.apply_sft_template_for_multi_turn_prompts(
    conversations=conversation, sft_format=processor.sft_format, system_prompt=""
)
prompt_text = sft_format + processor.image_start_tag

with torch.inference_mode():
    input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text))
    tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda()
    tokens[0] = input_ids  # conditional
    tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id  # unconditional
    
    inputs_embeds = model.language_model.get_input_embeddings()(tokens)
    generated = torch.zeros((1, 576), dtype=torch.int).cuda()
    
    past_kv = None
    for t in range(576):
        outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv)
        past_kv = outputs.past_key_values
        logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
        guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2])
        next_tok = torch.multinomial(torch.softmax(guided, -1), 1)
        generated[:, t] = next_tok.squeeze(-1)
        img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1))
        inputs_embeds = img_emb.unsqueeze(1)
    
    dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24])
    img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8)
    PIL.Image.fromarray(img[0]).save("thumbnail.png")
```

### Image → Thumbnail

```python
# Uses model's understanding to caption, then generates
python scripts/inference_janus.py --mode image --input_image photo.jpg
```

### Text + Image → Thumbnail

```python
# Uses both text instruction and reference image
python scripts/inference_janus.py --mode both \
    --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \
    --input_image food_photo.jpg
```

## 🔧 Training from Scratch

### Option 1: HuggingFace Jobs (Recommended)

```python
# Launch via HF Jobs API
from huggingface_hub import HfApi
api = HfApi()

# Requires: a100-large hardware, 8h timeout
# Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm, 
#               trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git
```

### Option 2: Local Training

```bash
# Clone repo and install
git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e .
pip install torch transformers datasets Pillow numpy tqdm trackio accelerate

# Run training (needs ~40GB VRAM, A100 recommended)
python run_training.py
```

### Option 3: Alternative — OmniGen LoRA (Lower VRAM)

For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU):

```bash
pip install OmniGen accelerate peft
accelerate launch train_omnigen.py \
    --model_name_or_path Shitao/OmniGen-v1 \
    --json_file train.jsonl \
    --image_path ./images \
    --use_lora --lora_rank 8 \
    --lr 1e-3 --epochs 3
```

## 📁 Repository Structure

```
├── README.md                    # This file
├── scripts/
│   ├── run_training.py          # End-to-end training pipeline (data prep + train + eval)
│   ├── inference_janus.py       # Inference for all 3 input modes
│   ├── train_janus.py           # Modular Janus training script
│   ├── train_omnigen.py         # Alternative OmniGen LoRA training
│   └── prepare_data.py          # Data preparation utilities
```

## 📈 Training Data Sources

| Dataset | Size | Content | Format |
|---|---|---|---|
| [PosterCraft/Poster100K](https://hf.co/datasets/PosterCraft/Poster100K) | 93K | Movie/TV posters | image + rich caption |
| [ShareGPT-4o-Image](https://hf.co/datasets/FreedomIntelligence/ShareGPT-4o-Image) | 91K | GPT-4o synthetic pairs | prompt + image |
| [CSU-JPG/TextAtlas5M](https://hf.co/datasets/CSU-JPG/TextAtlas5M) | 5M+ | Text-in-image data | image + annotation |
| [fantasyfish/laion-art](https://hf.co/datasets/fantasyfish/laion-art) | 20K | High-aesthetic images | image + text |

## 📚 References

- **Janus-Pro:** [arxiv:2501.17811](https://arxiv.org/abs/2501.17811) — Unified understanding and generation
- **Janus-4o:** [arxiv:2506.18095](https://arxiv.org/abs/2506.18095) — ShareGPT-4o-Image fine-tuning recipe
- **OmniGen:** [arxiv:2409.11340](https://arxiv.org/abs/2409.11340) — Unified image generation (alternative)
- **PosterCraft:** [arxiv:2506.10741](https://arxiv.org/abs/2506.10741) — Poster dataset and generation

## ⚖️ License

MIT (code) + [DeepSeek Model License](https://github.com/deepseek-ai/Janus/blob/main/LICENSE) (model weights)