art-alex's picture
Remove citation
b16125d verified
|
Raw
History Blame Contribute Delete
7.19 kB
metadata
base_model:
  - Lightricks/LTX-2.3
base_model_relation: adapter
license: other
license_name: ltx-2-community-license
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
language:
  - en
tags:
  - ltx-video
  - ic-lora
  - ltx-2.3
  - video-to-video
  - reference-sheet
  - character-consistency
pipeline_tag: video-to-video
extra_gated_description: >-
  By clicking "Agree and Access" you acknowledge the [Privacy
  Policy](https://static.lightricks.com/legal/Privacy%20Policy%20-%20LTX%20Platform.pdf) 
  and consent to receive offers and updates. You can unsubscribe at any time.
extra_gated_button_content: Agree and Access

LTX-2.3 22B IC-LoRA Reference Sheet Control

This is an IC-LoRA trained on top of LTX-2.3-22B, which conditions video generation on a reference sheet β€” a single composite image inventorying the characters, props, and location of a scene β€” so that generated videos keep those elements visually consistent.

It is based on the LTX-2.3 foundation model.

Model Files

ltx-2.3-22b-ic-lora-ingredients-0.9.safetensors

Model Details

  • Base Model: LTX-2.3-22B (dev)
  • Training Type: IC-LoRA (in-context LoRA)
  • Control Type: Reference-sheet conditioning β€” character / prop / location identity carried into the generated video
  • Reference Downscale Factor: 1 (the reference is provided at the same resolution as the output)
  • Pipeline details: The reference sheet is supplied as a static video (the still sheet looped to the output's length and frame rate). The model is trained with a video_to_video strategy over reference latents; no extra color/space transforms are applied at inference.

Intended Use & Out-of-Scope

Intended use: Generating short video clips that stay faithful to a supplied reference sheet β€” keeping recurring characters (face and costume), handled props, and the set/location consistent with the sheet while following an action described in the prompt.

Out of scope: This is not a general text-to-video model β€” it expects a reference sheet as conditioning. It was trained at a single resolution / length bucket (768Γ—448, 121 frames, 24 fps); other resolutions, much longer clips, or use without a reference sheet are out of distribution. It does not reproduce identities that are absent from the supplied sheet.

Control Signal Requirements

  • Control signal type: Reference sheet β€” a single composite image with one clean panel per distinct visual element (each character as a face close-up + body turnaround, each prop as a product-style render, and one clean location panel), laid out on a black background with no text.
  • Expected input: A static video built from the reference sheet, looped to match the output clip's length and frame rate, at the output resolution (downscale factor 1).
  • Preprocessing: Author the reference sheet with the element-driven reference-sheet generator, then loop the still into a static video. Frame count must be β‰₯ 121 so the reference-encoding / 121-frame read bucket is satisfied; all targets in training were β‰₯ 121 frames.
  • Alignment: The reference video should match the output resolution and frame rate; its frame count must be at least the output length (clamped to β‰₯ 121).

How It Works

The prompt is split into two labeled parts, matching how the model was trained:

Reference sheet: <description of the panels in the sheet β€” characters, props, location>

Generated video: <description of the action / shot you want generated>

At inference the reference sheet (as a static video) supplies the what things look like, and the Generated video: portion of the prompt supplies the what happens. The model reads the reference latents in-context and renders a new clip whose characters, props, and setting match the sheet.

Usage

πŸ”Œ ComfyUI

  1. Copy the LoRA weights into models/loras.
  2. Load the LTX-2.3-22B base model and add lora_weights_step_12000.safetensors as the LoRA.
  3. Start at strength 1.0 and adjust to taste.
  4. Use an IC-LoRA / reference workflow from the LTX-2 ComfyUI repository, which already wires the reference (control) input. Connect the reference-sheet static video as the control/reference input; a generic LoRA loader that ignores the reference path will not apply the conditioning. See the IC-LoRA docs.

Recommended Settings

  • LoRA strength / weight: 1.4
  • Inference steps: 30
  • Guidance scale: 4.0
  • Resolution & frames: 768Γ—448, 121 frames, 24 fps (the trained bucket β€” best results here)
  • Prompting: Use the two-part Reference sheet: … / Generated video: … structure above. The Reference sheet: text should describe the panels present; the Generated video: text drives the action. Suggested negative prompt: worst quality, inconsistent motion, blurry, jittery, distorted. Validation used spatiotemporal guidance (STG, mode stg_v, block 29, scale 1.0), which can help motion stability.

References

Tips & Troubleshooting

  • Bigger panels carry over better: The more space an element takes up in the reference image, the more faithfully it carries over into the generated video. Give important characters/props larger, more prominent panels rather than small or crowded ones.
  • Identity drift: If a character's face or costume drifts, make sure the reference sheet has a clean, front-facing close-up and full turnaround for that character, and that its panel isn't cluttered or text-laden.
  • Element not appearing: The model only reproduces elements present on the sheet β€” add a dedicated panel for any prop/character you need to persist, and describe it in the Reference sheet: portion of the prompt.
  • Reference too short: The reference static video must be β‰₯ 121 frames; shorter references break the reference-encoding bucket.

Dataset

The model was trained using a proprietary dataset of video clips paired with generated reference sheets.

Training

  • Technique: IC-LoRA (rank 128, alpha 128, dropout 0.0) on the DiT transformer β€” attn1/attn2 q/k/v/out projections and the feed-forward layers.
  • Hyperparameters: bf16 mixed precision, AdamW-8bit, gradient checkpointing, batch size 1, gradient accumulation 1, max grad norm 1.0, seed 42. Learning rate: 1.3e-4 (linear scheduler) for the first 6,000 steps, then a low constant 1.3e-5 for the continuation to 12,000.
  • Strategy: video_to_video over reference latents, first_frame_conditioning_p 0.0, reference downscale factor 1.
  • Steps: 12,000 (recommended checkpoint: step 12,000).
  • Infrastructure: LTX-2 Community Trainer, 8Γ— GPU DDP.

License

See the LTX-2-community-license for full terms.

Acknowledgments

  • Base model by Lightricks
  • Training infrastructure: LTX-2 Community Trainer