--- base_model: - deepseek-ai/Janus-Pro-7B datasets: - PosterCraft/Poster100K - FreedomIntelligence/ShareGPT-4o-Image language: - en library_name: transformers license: mit pipeline_tag: any-to-any tags: - text-to-image - image-to-image - text-and-image-to-image - multimodal - unified-model - thumbnail-generation - vlm --- # 🎨 Thumbnail VLM β€” Janus-Pro-7B for Thumbnail Generation A **Vision-Language Model** fine-tuned for professional thumbnail generation. Accepts flexible multimodal inputs (text, image, or both) and always outputs a thumbnail image. ## 🎯 Capabilities | Input Mode | Description | Example | |---|---|---| | **Text β†’ Thumbnail** | Generate thumbnail from text description | `"Epic gaming video about Minecraft"` β†’ πŸ–ΌοΈ | | **Image β†’ Thumbnail** | Generate thumbnail from reference image | πŸ“· β†’ πŸ–ΌοΈ | | **Text + Image β†’ Thumbnail** | Generate thumbnail from both | `"Make a cooking thumbnail"` + πŸ“· β†’ πŸ–ΌοΈ | ## πŸ—οΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Janus-Pro-7B Architecture β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”‚ β”‚ Input Text ──→ Tokenizer ──→ ┐ β”‚ β”‚ β”œβ”€β”€β†’ DeepSeek-LLM β”‚ β”‚ Input Image ──→ SigLIP ──→ β”˜ (7B, 30 layersβ”‚ β”‚ 4096-dim) β”‚ β”‚ β”‚ β”‚ DeepSeek-LLM ──→ gen_head ──→ VQ Logits β”‚ β”‚ (4096β†’16384) β”‚ β”‚ β”‚ β”‚ VQ Tokens ──→ VQ-16 Decoder ──→ Output Image β”‚ β”‚ (16384 codebook, (384Γ—384) β”‚ β”‚ 576 tokens/img) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` - **Base Model:** [deepseek-ai/Janus-Pro-7B](https://huggingface.co/deepseek-ai/Janus-Pro-7B) (7.4B params) - **Understanding Encoder:** SigLIP-Large (384Γ—384, 576 tokens) - **Generation Tokenizer:** VQ-16 (codebook=16384, 576 discrete tokens per image) - **Training Method:** Full SFT following [Janus-4o recipe](https://arxiv.org/abs/2506.18095) ## πŸ“Š Training Recipe | Parameter | Value | Source | |---|---|---| | Base model | `deepseek-ai/Janus-Pro-7B` | Janus-4o paper | | Learning Rate | 5e-6 | Janus-4o Β§3.3 | | Epochs | 3 | Janus-4o Β§3.3 | | Effective Batch Size | 16 (1Γ—16 grad accum) | Adapted from paper's 128 | | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95) | Janus-4o | | CFG Prompt Masking | 10% | Janus-4o Β§3.1 | | Precision | bfloat16 | Model default | | Image Resolution | 384Γ—384 | Architecture constraint | | Frozen | SigLIP + VQ Tokenizer | Efficiency | | Trainable | LLM + gen_head + aligners | ~6.5B params | ### Training Data | Dataset | Samples | Type | |---|---|---| | [PosterCraft/Poster100K](https://huggingface.co/datasets/PosterCraft/Poster100K) | 8,000 | Movie/TV posters (T2I) | | Synthetic thumbnail prompts | 2,000 | YouTube-style prompts (T2I) | | **Total** | **~10,000** | | ## πŸš€ Quick Start ### Installation ```bash # Install Janus library git clone https://github.com/deepseek-ai/Janus.git cd Janus && pip install -e . # Install other dependencies pip install torch transformers Pillow numpy ``` ### Text β†’ Thumbnail ```python import torch import numpy as np import PIL.Image from transformers import AutoModelForCausalLM from janus.models import MultiModalityCausalLM, VLChatProcessor model_path = "asats/thumbnail-vlm-janus-pro" processor = VLChatProcessor.from_pretrained(model_path) model = AutoModelForCausalLM.from_pretrained( model_path, trust_remote_code=True, torch_dtype=torch.bfloat16 ).cuda().eval() # Generate thumbnail prompt = "Professional tech review thumbnail: iPhone 16 with dramatic lighting, text 'BEST PHONE 2025'" conversation = [ {"role": "<|User|>", "content": prompt}, {"role": "<|Assistant|>", "content": ""}, ] sft_format = processor.apply_sft_template_for_multi_turn_prompts( conversations=conversation, sft_format=processor.sft_format, system_prompt="" ) prompt_text = sft_format + processor.image_start_tag with torch.inference_mode(): input_ids = torch.LongTensor(processor.tokenizer.encode(prompt_text)) tokens = torch.zeros((2, len(input_ids)), dtype=torch.int).cuda() tokens[0] = input_ids # conditional tokens[1] = input_ids; tokens[1, 1:-1] = processor.pad_id # unconditional inputs_embeds = model.language_model.get_input_embeddings()(tokens) generated = torch.zeros((1, 576), dtype=torch.int).cuda() past_kv = None for t in range(576): outputs = model.language_model.model(inputs_embeds=inputs_embeds, use_cache=True, past_key_values=past_kv) past_kv = outputs.past_key_values logits = model.gen_head(outputs.last_hidden_state[:, -1, :]) guided = logits[1:2] + 5.0 * (logits[0:1] - logits[1:2]) next_tok = torch.multinomial(torch.softmax(guided, -1), 1) generated[:, t] = next_tok.squeeze(-1) img_emb = model.prepare_gen_img_embeds(torch.cat([next_tok, next_tok], 0).squeeze(-1)) inputs_embeds = img_emb.unsqueeze(1) dec = model.gen_vision_model.decode_code(generated, shape=[1, 8, 24, 24]) img = np.clip((dec.float().cpu().numpy().transpose(0,2,3,1) + 1) / 2 * 255, 0, 255).astype(np.uint8) PIL.Image.fromarray(img[0]).save("thumbnail.png") ``` ### Image β†’ Thumbnail ```python # Uses model's understanding to caption, then generates python scripts/inference_janus.py --mode image --input_image photo.jpg ``` ### Text + Image β†’ Thumbnail ```python # Uses both text instruction and reference image python scripts/inference_janus.py --mode both \ --prompt "Create a cooking video thumbnail with text 'EASY RECIPE'" \ --input_image food_photo.jpg ``` ## πŸ”§ Training from Scratch ### Option 1: HuggingFace Jobs (Recommended) ```python # Launch via HF Jobs API from huggingface_hub import HfApi api = HfApi() # Requires: a100-large hardware, 8h timeout # Dependencies: torch, transformers, datasets, Pillow, numpy, tqdm, # trackio, accelerate, janus @ git+https://github.com/deepseek-ai/Janus.git ``` ### Option 2: Local Training ```bash # Clone repo and install git clone https://github.com/deepseek-ai/Janus.git && cd Janus && pip install -e . pip install torch transformers datasets Pillow numpy tqdm trackio accelerate # Run training (needs ~40GB VRAM, A100 recommended) python run_training.py ``` ### Option 3: Alternative β€” OmniGen LoRA (Lower VRAM) For a lighter approach using OmniGen-v1 (3.8B params, LoRA fine-tuning on single 24GB GPU): ```bash pip install OmniGen accelerate peft accelerate launch train_omnigen.py \ --model_name_or_path Shitao/OmniGen-v1 \ --json_file train.jsonl \ --image_path ./images \ --use_lora --lora_rank 8 \ --lr 1e-3 --epochs 3 ``` ## πŸ“ Repository Structure ``` β”œβ”€β”€ README.md # This file β”œβ”€β”€ scripts/ β”‚ β”œβ”€β”€ run_training.py # End-to-end training pipeline (data prep + train + eval) β”‚ β”œβ”€β”€ inference_janus.py # Inference for all 3 input modes β”‚ β”œβ”€β”€ train_janus.py # Modular Janus training script β”‚ β”œβ”€β”€ train_omnigen.py # Alternative OmniGen LoRA training β”‚ └── prepare_data.py # Data preparation utilities ``` ## πŸ“ˆ Training Data Sources | Dataset | Size | Content | Format | |---|---|---|---| | [PosterCraft/Poster100K](https://hf.co/datasets/PosterCraft/Poster100K) | 93K | Movie/TV posters | image + rich caption | | [ShareGPT-4o-Image](https://hf.co/datasets/FreedomIntelligence/ShareGPT-4o-Image) | 91K | GPT-4o synthetic pairs | prompt + image | | [CSU-JPG/TextAtlas5M](https://hf.co/datasets/CSU-JPG/TextAtlas5M) | 5M+ | Text-in-image data | image + annotation | | [fantasyfish/laion-art](https://hf.co/datasets/fantasyfish/laion-art) | 20K | High-aesthetic images | image + text | ## πŸ“š References - **Janus-Pro:** [arxiv:2501.17811](https://arxiv.org/abs/2501.17811) β€” Unified understanding and generation - **Janus-4o:** [arxiv:2506.18095](https://arxiv.org/abs/2506.18095) β€” ShareGPT-4o-Image fine-tuning recipe - **OmniGen:** [arxiv:2409.11340](https://arxiv.org/abs/2409.11340) β€” Unified image generation (alternative) - **PosterCraft:** [arxiv:2506.10741](https://arxiv.org/abs/2506.10741) β€” Poster dataset and generation ## βš–οΈ License MIT (code) + [DeepSeek Model License](https://github.com/deepseek-ai/Janus/blob/main/LICENSE) (model weights)