Wuji Writing VAM TI2V-5B 30L 0522 0819
This repository contains the latest saved checkpoint from the run:
wuji_writing_vam_ti2v5b_30L_0522_0819
It is a joint VAM checkpoint for human-reference-conditioned robot brush-writing. The checkpoint contains both fine-tuned Wan2.2-TI2V-5B video DiT weights and action_dit.* action-stream weights.
Given a human egocentric writing reference video, a current robot 3-view anchor frame, a text prompt, and the current 54-D robot proprioceptive state, the model predicts:
- a short robot 3-view video rollout, and
- a sequence of 54-D robot action targets.
Files
step-12500.safetensors latest checkpoint from the run
model_config.json architecture and preprocessing contract
training_config.yaml training configuration snapshot
action_stats.npy action normalization stats from teleop actions
training_log_node0.txt training log snapshot
training_log_node1.txt training log snapshot
README.md this model card
Checkpoint SHA256:
5c929ca1c0870f2cf402a56bcf31b40789ee78bf08436e03dfcb941190f2b09d
Checkpoint size:
11541393789 bytes
Training Setup
run directory: src/vam/models/train/wuji_writing_vam_ti2v5b_30L
checkpoint: step-12500.safetensors
wandb run name: wuji_writing_vam_ti2v5b_30L_0522_0819
wandb run: https://wandb.ai/wuji-tech/wuji_writing/runs/a9ejrkhi
backbone: Wan2.2-TI2V-5B
trainable: video DiT + ActionMoT
mask variant: v2
full ref: true
bridge exclude ref: true
action_dim: 54
proprio_dim: 54
resolution: 384 x 320
raw frames: 33
target frames: 9
action horizon: 32
reference frames: 69
The Wan2.2-TI2V-5B base model is not included. You need it separately under:
models/Wan-AI/Wan2.2-TI2V-5B/
with the Wan2.2 TI2V diffusion shards, text encoder, VAE, and tokenizer files expected by this codebase.
Data Contract
The target robot video is a 3-view grid:
top: observation.images.stereo_left
bottom: observation.images.cam_left_wrist | observation.images.cam_right_wrist
The human reference video is single-view egocentric:
observation.images.head
Both are resized to 384 x 320 with resize_mode=stretch. The reference video is temporally subsampled by //10, then uniformly clipped or gray-padded to 69 frames.
The target and reference are paired by task_index digit using deterministic one-to-one rank matching within each digit. They are not frame-aligned or geometrically aligned.
Dataset
Training used the local wuji-writing bundle:
teleop episodes: 487
ego_ref episodes: 896
train episodes: 439
val episodes: 48
digits: 0-9
The included action_stats.npy was computed over 438,986 teleop action rows and contains:
{
"mean": float32[54],
"std": float32[54],
"count": 438986,
}
Validation Snapshot
Last validation line observed for this checkpoint:
[val step 12500] 20/39326 samples: loss=0.278802, loss_video=0.146057, loss_action=0.132745
[val step 12500] task=write_zero action_MSE=3.898803 action_MAE=0.801526 video_MSE=2325.59 PSNR=16.28 SSIM=0.5810 LPIPS=0.2252
These numbers are a training-time validation snapshot, not a full deployment evaluation.
Important Notes
This checkpoint is not a standalone Wan model. It must be loaded with the matching VAM/Wan2.2-TI2V code path in this repository and the separate Wan2.2-TI2V-5B base weights.
The robot target is a 3-view grid while the human reference is a single egocentric view. This is intentional in the current writing setup: the reference is used as high-level task/trajectory conditioning, not as a frame-level aligned video-editing source.
Model tree for knightnemo/wuji-writing-vam-ti2v5b-30l-0522-0819
Base model
Wan-AI/Wan2.2-TI2V-5B