| --- |
| base_model: |
| - Lightricks/LTX-2.3 |
| base_model_relation: adapter |
| license: other |
| license_name: ltx-2-community-license |
| license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE |
| language: |
| - en |
| tags: |
| - ltx-video |
| - ic-lora |
| - ltx-2.3 |
| - video-to-video |
| - reference-sheet |
| - character-consistency |
| pipeline_tag: video-to-video |
| extra_gated_description: >- |
| By clicking "Agree and Access" you acknowledge the [Privacy |
| Policy](https://static.lightricks.com/legal/Privacy%20Policy%20-%20LTX%20Platform.pdf) |
| and consent to receive offers and updates. You can unsubscribe at any time. |
| extra_gated_button_content: Agree and Access |
| --- |
| |
| # LTX-2.3 22B IC-LoRA Reference Sheet Control |
|
|
| This is an **IC-LoRA** trained on top of **LTX-2.3-22B**, which conditions video generation on a *reference sheet* β a single composite image inventorying the characters, props, and location of a scene β so that generated videos keep those elements visually consistent. |
|
|
| It is based on the [LTX-2.3](https://huggingface.co/Lightricks/LTX-2.3) foundation model. |
|
|
| ## Model Files |
|
|
| `ltx-2.3-22b-ic-lora-ingredients-0.9.safetensors` |
|
|
| ## Model Details |
|
|
| - **Base Model:** LTX-2.3-22B (dev) |
| - **Training Type:** IC-LoRA (in-context LoRA) |
| - **Control Type:** Reference-sheet conditioning β character / prop / location identity carried into the generated video |
| - **Reference Downscale Factor:** 1 (the reference is provided at the same resolution as the output) |
| - **Pipeline details:** The reference sheet is supplied as a *static video* (the still sheet looped to the output's length and frame rate). The model is trained with a `video_to_video` strategy over reference latents; no extra color/space transforms are applied at inference. |
|
|
| ## Intended Use & Out-of-Scope |
|
|
| **Intended use:** Generating short video clips that stay faithful to a supplied reference sheet β keeping recurring characters (face and costume), handled props, and the set/location consistent with the sheet while following an action described in the prompt. |
|
|
| **Out of scope:** This is not a general text-to-video model β it expects a reference sheet as conditioning. It was trained at a single resolution / length bucket (768Γ448, 121 frames, 24 fps); other resolutions, much longer clips, or use without a reference sheet are out of distribution. It does not reproduce identities that are absent from the supplied sheet. |
|
|
| ## Control Signal Requirements |
|
|
| - **Control signal type:** Reference sheet β a single composite image with one clean panel per distinct visual element (each character as a face close-up + body turnaround, each prop as a product-style render, and one clean location panel), laid out on a black background with no text. |
| - **Expected input:** A static video built from the reference sheet, looped to match the output clip's length and frame rate, at the output resolution (downscale factor 1). |
| - **Preprocessing:** Author the reference sheet with the element-driven reference-sheet generator, then loop the still into a static video. Frame count must be β₯ 121 so the reference-encoding / 121-frame read bucket is satisfied; all targets in training were β₯ 121 frames. |
| - **Alignment:** The reference video should match the output resolution and frame rate; its frame count must be at least the output length (clamped to β₯ 121). |
|
|
| ## How It Works |
|
|
| The prompt is split into two labeled parts, matching how the model was trained: |
|
|
| ``` |
| Reference sheet: <description of the panels in the sheet β characters, props, location> |
| |
| Generated video: <description of the action / shot you want generated> |
| ``` |
|
|
| At inference the reference sheet (as a static video) supplies the *what things look like*, and the `Generated video:` portion of the prompt supplies the *what happens*. The model reads the reference latents in-context and renders a new clip whose characters, props, and setting match the sheet. |
|
|
| ## Usage |
|
|
| ### π ComfyUI |
|
|
| 1. Copy the LoRA weights into `models/loras`. |
| 2. Load the **LTX-2.3-22B** base model and add `lora_weights_step_12000.safetensors` as the LoRA. |
| 3. Start at strength `1.0` and adjust to taste. |
| 4. Use an IC-LoRA / reference workflow from the [LTX-2 ComfyUI repository](https://github.com/Lightricks/ComfyUI-LTXVideo/), which already wires the reference (control) input. Connect the reference-sheet static video as the control/reference input; a generic LoRA loader that ignores the reference path will not apply the conditioning. See the [IC-LoRA docs](https://docs.ltx.video/open-source-model/usage-guides/ic-lo-ra). |
|
|
| ## Recommended Settings |
|
|
| - **LoRA strength / weight:** 1.4 |
| - **Inference steps:** 30 |
| - **Guidance scale:** 4.0 |
| - **Resolution & frames:** 768Γ448, 121 frames, 24 fps (the trained bucket β best results here) |
| - **Prompting:** Use the two-part `Reference sheet: β¦ / Generated video: β¦` structure above. The `Reference sheet:` text should describe the panels present; the `Generated video:` text drives the action. Suggested negative prompt: `worst quality, inconsistent motion, blurry, jittery, distorted`. Validation used spatiotemporal guidance (STG, mode `stg_v`, block 29, scale 1.0), which can help motion stability. |
|
|
| ## References |
|
|
| - **Code:** [GitHub Repository](https://github.com/Lightricks/LTX-2) |
| - **IC-LoRA docs:** [docs.ltx.video β IC-LoRA usage guide](https://docs.ltx.video/open-source-model/usage-guides/ic-lo-ra) |
|
|
| ## Tips & Troubleshooting |
|
|
| - **Bigger panels carry over better:** The more space an element takes up in the reference image, the more faithfully it carries over into the generated video. Give important characters/props larger, more prominent panels rather than small or crowded ones. |
| - **Identity drift:** If a character's face or costume drifts, make sure the reference sheet has a clean, front-facing close-up and full turnaround for that character, and that its panel isn't cluttered or text-laden. |
| - **Element not appearing:** The model only reproduces elements present on the sheet β add a dedicated panel for any prop/character you need to persist, and describe it in the `Reference sheet:` portion of the prompt. |
| - **Reference too short:** The reference static video must be β₯ 121 frames; shorter references break the reference-encoding bucket. |
|
|
| ## Dataset |
|
|
| The model was trained using a proprietary dataset of video clips paired with generated reference sheets. |
|
|
| ## Training |
|
|
| - **Technique:** IC-LoRA (rank 128, alpha 128, dropout 0.0) on the DiT transformer β `attn1`/`attn2` q/k/v/out projections and the feed-forward layers. |
| - **Hyperparameters:** bf16 mixed precision, AdamW-8bit, gradient checkpointing, batch size 1, gradient accumulation 1, max grad norm 1.0, seed 42. Learning rate: 1.3e-4 (linear scheduler) for the first 6,000 steps, then a low constant 1.3e-5 for the continuation to 12,000. |
| - **Strategy:** `video_to_video` over reference latents, `first_frame_conditioning_p` 0.0, reference downscale factor 1. |
| - **Steps:** 12,000 (recommended checkpoint: step 12,000). |
| - **Infrastructure:** LTX-2 Community Trainer, 8Γ GPU DDP. |
|
|
| ## License |
|
|
| See the **LTX-2-community-license** for full terms. |
|
|
| ## Acknowledgments |
|
|
| - Base model by **Lightricks** |
| - Training infrastructure: **LTX-2 Community Trainer** |
|
|