HY-Motion T2M 1.0
Tencent Hunyuan open-source text-to-motion model integrated into the hftrainer
Model Zoo as a first-class HyMotionT2MBundle / HyMotionT2MPipeline.
The release ships two variants that share representation, text encoder, training recipe, and inference protocol, and differ only in MMDiT size.
| Task | Text-to-Motion (T2M) |
| Bundle / Pipeline | HyMotionT2MBundle / HyMotionT2MPipeline |
| Processed HF artifact | full, lite |
| Architecture | HunyuanMotionMMDiT flow matching, Euler ODE, 50 steps, CFG scale 5.0 |
| Native representation | 201-dim HY-Motion feature at 30 fps; hftrainer renders/scores from the motion_135 slice |
| Text encoder | Qwen3-8B token context + CLIP-L sentence embedding, frozen and stored in the hftrainer artifact |
| Original weights | tencent/HY-Motion-1.0, mirrored locally under checkpoints/HY-Motion-1.0/ |
Weights
Self-contained hftrainer artifacts are stored locally and reload with
HyMotionT2MBundle.from_pretrained. They include the motion transformer,
classifier-free null embeddings, 201-dim Mean/Std stats, and frozen Qwen3-8B /
CLIP-L text encoder directories. The artifact writes model_index.json
metadata for hftrainer discovery, but it is not a native diffusers
DiffusionPipeline repo.
| Variant | Local artifact | Processed Hugging Face artifact | Contents |
|---|---|---|---|
| HY-Motion T2M 1.0 | checkpoints/hymotion_t2m/1.0b |
ZeyuLing/hftrainer-hymotion-t2m-1.0 |
motion_transformer.safetensors, hymotion_t2m_config.json, model_index.json, Mean.npy, Std.npy, text_encoder/llm/, text_encoder/sentence/ |
| HY-Motion T2M 1.0-Lite | checkpoints/hymotion_t2m/0.46b |
ZeyuLing/hftrainer-hymotion-t2m-1.0-lite |
same layout |
Use a local artifact:
from hftrainer.pipelines.motion.hymotion_t2m_pipeline import HyMotionT2MPipeline
pipe = HyMotionT2MPipeline.from_pretrained(
"checkpoints/hymotion_t2m/1.0b",
device="cuda",
text_dtype="bf16",
num_steps=50,
text_guidance_scale=5.0,
should_apply_smoothing=True,
)
out = pipe({"caption": ["a person walks forward."], "num_frames": [196]})
rot6d = out["rot6d"] # (B, T, 22, 6)
transl = out["transl"] # (B, T, 3)
For the HF artifact:
from hftrainer.pipelines.motion.hymotion_t2m_pipeline import HyMotionT2MPipeline
pipe = HyMotionT2MPipeline.from_pretrained(
"ZeyuLing/hftrainer-hymotion-t2m-1.0",
device="cuda",
text_dtype="bf16",
num_steps=50,
text_guidance_scale=5.0,
should_apply_smoothing=True,
)
out = pipe({"caption": ["a person walks forward."], "num_frames": [196]})
rot6d = out["rot6d"] # (B, T, 22, 6)
transl = out["transl"] # (B, T, 3)
The same config can be reconstructed without loading weights:
cfg_bundle = HyMotionT2MBundle.from_config(
"checkpoints/hymotion_t2m/1.0b/hymotion_t2m_config.json"
)
assert not cfg_bundle.text_encoder_requires_external_weights()
Artifacts are produced with:
python3 scripts/eval/convert_hymotion_checkpoint.py \
--out_dir checkpoints/hymotion_t2m/1.0b --variant 1.0b --verify
python3 scripts/eval/convert_hymotion_checkpoint.py \
--config configs/hymotion_t2m/hymotion_t2m_201dim_046b.py \
--ckpt checkpoints/HY-Motion-1.0/HY-Motion-1.0-Lite/latest.ckpt \
--out_dir checkpoints/hymotion_t2m/0.46b --variant 0.46b --verify
Variants
| HY-Motion T2M 1.0 | HY-Motion T2M 1.0-Lite | |
|---|---|---|
feat_dim |
1280 | 1024 |
num_layers |
27 | 18 |
num_heads |
20 | 16 |
input_dim / output_dim |
201 / 201 | 201 / 201 |
| config | configs/hymotion_t2m/hymotion_t2m_201dim_full.py |
configs/hymotion_t2m/hymotion_t2m_201dim_046b.py |
| upstream checkpoint | checkpoints/HY-Motion-1.0/HY-Motion-1.0/latest.ckpt |
checkpoints/HY-Motion-1.0/HY-Motion-1.0-Lite/latest.ckpt |
Evaluation Protocol
Published Model-Zoo metrics use the official HY-Motion inference path:
- CFG scale 5.0
- 50 Euler ODE steps
- MMDiT / ODE / null embeddings / Mean-Std in fp32
- text encoder in bf16, with text features upcast to fp32 before MMDiT
- decode smoothing enabled: SLERP on rot6d and Savitzky-Golay on root translation
HY-Motion outputs SMPL motion_135; for MotionStreamer comparison it is encoded
to MotionStreamer-272 and scored with MotionStreamer272Evaluator. HumanML3D-263
cross-eval converts the same indexed MS272 predictions and paired MS272 GT clips
through motion272_to_hml263, then scores with HumanML263Evaluator.
MotionStreamer-272 Evaluator
HY-Motion T2M 1.0 smooth full HumanML3D test run:
outputs/evaluation/hymotion_h3d272/metrics_smooth.json.
| Metric | HY-Motion T2M 1.0 | MS272 GT/Real |
|---|---|---|
| FID β | 16.021 | 0.000 |
| R-Precision Top-1 β | 0.737 | 0.706 |
| R-Precision Top-2 β | 0.881 | 0.857 |
| R-Precision Top-3 β | 0.929 | 0.911 |
| MM-Dist β | 14.789 | 15.007 |
| Diversity β | 27.187 | 27.367 |
HY-Motion T2M 1.0-Lite MS272 metrics are pending
outputs/evaluation/hymotion_h3d272/metrics_lite_smooth.json.
HumanML3D-263 Cross-Eval
HY-Motion T2M 1.0 smooth cross-eval:
outputs/evaluation/hymotion_h3d272/metrics_smooth_h3d263.json.
This is not a native HumanML3D-263 generation run. It converts the indexed MotionStreamer-272 predictions and their paired MS272 GT clips to HML263 and scores the aligned population.
| Metric | HY-Motion T2M 1.0 | Converted GT/Real |
|---|---|---|
| FID β | 0.103 | 0.000 |
| R-Precision Top-1 β | 0.561 | 0.522 |
| R-Precision Top-2 β | 0.761 | 0.725 |
| R-Precision Top-3 β | 0.853 | 0.823 |
| MM-Dist β | 2.532 | 2.691 |
| Diversity β | 10.031 | 9.876 |
Run details: n_samples = 7340, n_repeats = 20, caption_selection = first,
drop_last = true.
HY-Motion T2M 1.0-Lite HML263 metrics are pending
outputs/evaluation/hymotion_h3d272/metrics_lite_smooth_h3d263.json.
Reproduce the full-variant cross-eval:
python3 scripts/eval/eval_272dir_h3d263.py \
--pred_dir outputs/evaluation/hymotion_h3d272/hy_272_smooth \
--out_json outputs/evaluation/hymotion_h3d272/metrics_smooth_h3d263.json \
--with_fid --workers 16 --caption_selection first
Implementation Notes
HyMotionT2MBundle.save_pretrainedwrites a self-contained hftrainer artifact withmotion_transformer.safetensors, null CFG embeddings, Mean/Std,text_encoder/llm/,text_encoder/sentence/, andmodel_index.jsonmetadata. Passinginclude_text_encoder=Falseis a legacy lightweight export mode and is not used for Model-Zoo publishing.HyMotionT2MBundle.from_pretrainedaccepts a local path or HF Hub id and keeps text encoder loading lazy; new artifacts resolve Qwen3-8B and CLIP-L from the artifact-localtext_encoder/directories.HyMotionT2MBundle.from_configaccepts either the raw bundle config or the savedhymotion_t2m_config.json, matching the hftrainer ModelBundle API.- Raw/no-smoothing outputs are diagnostic only. The previous bf16-ODE / wrong-CFG metrics are deprecated and must not be used for Model-Zoo reporting.
- The
h3d272suffix in scripts and output paths is historical; the evaluator space is MotionStreamer-272 on the HumanML3D test split.