---
license: apache-2.0
base_model:
  - zai-org/CogVideoX-2b
language:
  - en
tags:
  - video-to-video
  - subtitle-removal
  - lora
  - cogvideox
  - diffusers
pipeline_tag: text-to-video
---

# CogVideoX-2b CLEAR LoRA — Subtitle Removal (Supplementary)

This repository releases **LoRA + expanded input-projection weights** for **video-to-video subtitle removal** on top of **[zai-org/CogVideoX-2b](https://huggingface.co/zai-org/CogVideoX-2b)**.

> **Disclaimer:** This is a **supplementary** experiment from the CLEAR project. The main paper results use **Wan2.1-Control**; this CogVideoX-2b variant is **not** expected to match that baseline. It is shared for **reproducibility and comparison**.


## Architecture change (high level)

CogVideoX-2b is originally **text-to-video**. For conditioning, the first-stage conv input is expanded:

- **Before:** `patch_embed.proj`: Conv2d(16 → 1920, …)
- **After:** `patch_embed.proj`: Conv2d(32 → 1920, …)  
  - First 16 channels: noisy latent (inherits pretrained weights)  
  - Last 16 channels: subtitle-video latent (new channels, trained)

Inference concatenates **noisy latent** and **subtitle latent** along the channel dimension before the transformer, consistent with training.

## Intended use

- **Research:** subtitle removal / video inpainting with diffusion.
- **Not** for high-stakes or misleading content; users are responsible for compliance with law and platform policies.

## How to use

1. Download **CogVideoX-2b** from `zai-org/CogVideoX-2b`.
2. Place `cogvideox_2b_CLEAR_lora_checkpoint.pt` locally.
3. Run inference with the provided script (example):

```bash
export MODEL_PATH="/path/to/CogVideoX-2b"
export CHECKPOINT="/path/to/cogvideox_2b_CLEAR_lora_checkpoint.pt"

bash scripts/inference_cogvideox_2b.sh \
  --input_video /path/to/video_with_subtitles.mp4 \
  --output_dir ./output