--- license: apache-2.0 language: - en - zh tags: - motion-generation - vision-language - robotics - qwen - dual-stream datasets: - MotionVLA-Dataset --- # MotionVLA **MotionVLA** is an end-to-end vision-language-action model for humanoid motion generation. It combines a **Qwen3.5** autoregressive backbone (conditioned on a scene image and a text instruction) with **DSFT (Dual-Stream Frequency-domain Tokenizer)**, which decouples low-frequency pose semantics from high-frequency physical dynamics. ## Repository Contents This HuggingFace repository contains: | Path | Description | |------|-------------| | `tokenizer/` | DSFT tokenizer checkpoints | | `tokenizer/base/` | Base stream BPE tokenizer (4096 vocab, 201-dim DCT) | | `tokenizer/phys/` | Phys stream BPE tokenizer (4096 vocab, 75-dim DCT) | | `dataset/` | Dataset index files (motion_path → relative paths) | **Motion data files** (`.pt`) and **images** are stored in the companion dataset repo: `[your-hf-username]/MotionVLA-Dataset` ## Tokenizer Design The DSFT tokenizer decomposes 276-dim ViMoGen motion into two streams: ``` 276-dim motion (T frames) ↓ split by dimension Base (201-dim): body_pose_6d + joints + root_orient + root_trans ← low-freq semantic Phys (75-dim): joints_vel + root_vel + root_trans_vel ← high-freq dynamics ↓ DCT along time axis, keep top K coefficients ↓ BPE encoding Base tokens: ~477/sequence (K=5, vocab=4096) Phys tokens: ~40/sequence (K=15, vocab=4096) ``` Each motion sample is laid out as a unified autoregressive sequence: ``` [ M_BOS, b_1, ..., b_N, M_SEP, p_1, ..., p_M, M_EOS ] ``` where `b_i` are Base tokens and `p_j` are Phys tokens. A phase-aware logit mask enforces the order `BASE → SEP → PHYS → EOS` at inference, so semantic pose structure is generated before high-frequency physical dynamics. ## Token Vocabulary The Qwen3.5 backbone vocabulary is extended with motion tokens (used in the ms-swift training pipeline): | Token type | ID range | Count | |------------|----------|-------| | Base motion tokens | 248320 – 252415 | 4096 | | Phys motion tokens | 252416 – 256511 | 4096 | | MOTION_BOS | 256512 | 1 | | MOTION_SEP | 256513 | 1 | | MOTION_EOS | 256514 | 1 | ## Usage ```python from tokenizer.ds_fast_tokenizer import DSFTTokenizer import numpy as np # Load tokenizer tok = DSFTTokenizer.load("tokenizer/checkpoints") # Encode 276-dim motion motion = np.load("motion.npy") # shape: (T, 276) result = tok.encode(motion) # result["base_tokens"]: list of int (BPE IDs for base stream) # result["phys_tokens"]: list of int (BPE IDs for phys stream) # result["T"]: number of frames # Decode back base_recon, phys_recon = tok.decode( result["base_tokens"], result["phys_tokens"], result["T"]) # base_recon: (T, 201), phys_recon: (T, 75) ``` ## Code Training code and model architecture: [GitHub](https://github.com/AIGeeksGroup/MotionVLA) ## Citation ```bibtex @article{motionvla2026, title={MotionVLA: Vision-Language-Action Model for Humanoid Motion}, author={Zhang, Nonghai and Zhai, Siyu and Zhang, Zeyu and Tang, Hao}, year={2026} } ```