Wuji Writing VAM TI2V-5B 30L 0522 0819

This repository contains the latest saved checkpoint from the run:

wuji_writing_vam_ti2v5b_30L_0522_0819

It is a joint VAM checkpoint for human-reference-conditioned robot brush-writing. The checkpoint contains both fine-tuned Wan2.2-TI2V-5B video DiT weights and action_dit.* action-stream weights.

Given a human egocentric writing reference video, a current robot 3-view anchor frame, a text prompt, and the current 54-D robot proprioceptive state, the model predicts:

  • a short robot 3-view video rollout, and
  • a sequence of 54-D robot action targets.

Files

step-12500.safetensors     latest checkpoint from the run
model_config.json          architecture and preprocessing contract
training_config.yaml       training configuration snapshot
action_stats.npy           action normalization stats from teleop actions
training_log_node0.txt     training log snapshot
training_log_node1.txt     training log snapshot
README.md                  this model card

Checkpoint SHA256:

5c929ca1c0870f2cf402a56bcf31b40789ee78bf08436e03dfcb941190f2b09d

Checkpoint size:

11541393789 bytes

Training Setup

run directory:       src/vam/models/train/wuji_writing_vam_ti2v5b_30L
checkpoint:          step-12500.safetensors
wandb run name:      wuji_writing_vam_ti2v5b_30L_0522_0819
wandb run:           https://wandb.ai/wuji-tech/wuji_writing/runs/a9ejrkhi
backbone:            Wan2.2-TI2V-5B
trainable:           video DiT + ActionMoT
mask variant:        v2
full ref:            true
bridge exclude ref:  true
action_dim:          54
proprio_dim:         54
resolution:          384 x 320
raw frames:          33
target frames:       9
action horizon:      32
reference frames:    69

The Wan2.2-TI2V-5B base model is not included. You need it separately under:

models/Wan-AI/Wan2.2-TI2V-5B/

with the Wan2.2 TI2V diffusion shards, text encoder, VAE, and tokenizer files expected by this codebase.

Data Contract

The target robot video is a 3-view grid:

top:    observation.images.stereo_left
bottom: observation.images.cam_left_wrist | observation.images.cam_right_wrist

The human reference video is single-view egocentric:

observation.images.head

Both are resized to 384 x 320 with resize_mode=stretch. The reference video is temporally subsampled by //10, then uniformly clipped or gray-padded to 69 frames.

The target and reference are paired by task_index digit using deterministic one-to-one rank matching within each digit. They are not frame-aligned or geometrically aligned.

Dataset

Training used the local wuji-writing bundle:

teleop episodes: 487
ego_ref episodes: 896
train episodes: 439
val episodes:   48
digits:         0-9

The included action_stats.npy was computed over 438,986 teleop action rows and contains:

{
    "mean": float32[54],
    "std": float32[54],
    "count": 438986,
}

Validation Snapshot

Last validation line observed for this checkpoint:

[val step 12500] 20/39326 samples: loss=0.278802, loss_video=0.146057, loss_action=0.132745
[val step 12500] task=write_zero action_MSE=3.898803 action_MAE=0.801526 video_MSE=2325.59 PSNR=16.28 SSIM=0.5810 LPIPS=0.2252

These numbers are a training-time validation snapshot, not a full deployment evaluation.

Important Notes

This checkpoint is not a standalone Wan model. It must be loaded with the matching VAM/Wan2.2-TI2V code path in this repository and the separate Wan2.2-TI2V-5B base weights.

The robot target is a 3-view grid while the human reference is a single egocentric view. This is intentional in the current writing setup: the reference is used as high-level task/trajectory conditioning, not as a frame-level aligned video-editing source.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-writing-vam-ti2v5b-30l-0522-0819

Finetuned
(54)
this model