Instructions to use MartaYang007/ONE-SHOT-14B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use MartaYang007/ONE-SHOT-14B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("MartaYang007/ONE-SHOT-14B", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration
Official inference code for ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration.
ONE-SHOT is a parameter-efficient framework for controllable human-environment video synthesis. It supports independent control over subject identity, human motion, scene context, and camera trajectory while preserving persistent identity and stable interactions in long generations.
๐งพ Abstract
Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. We introduce ONE-SHOT, a parameter-efficient framework built upon pre-trained VFMs that achieves high-fidelity synthesis of human-environment videos with independent control over subject appearance, human dynamics, spatial environments, and camera trajectories. By optimizing only a sparse set of parameters, it achieves precise control while preserving responsiveness to textual instructions. A canonical-space motion injection mechanism mitigates conditioning competition between rigid human priors and text prompts. By anchoring static and dynamic context, ONE-SHOT ensures persistent subject identity and stable human-environment interactions across minute-scale generations. Extensive experiments demonstrate that ONE-SHOT significantly outperforms existing methods in structural control and creative diversity.
๐ผ๏ธ Overview
๐ฐ News
- ๐ฅ
2026/05/31: TheONE-SHOT-14Bdiffusers checkpoint can be downloaded from Hugging Face. - ๐ฅ
2026/04/01: The ONE-SHOT paper is available on arXiv. - ๐ฅ
2026/04/01: ONE-SHOT project materials are available on the project page.
๐ Metrics
๐ Quick Start
๐ ๏ธ Installation
Recommended environment:
- Linux
- NVIDIA GPU
- CUDA 12.1 compatible driver
- Python 3.12
ffmpegwithlibx264support
git clone https://github.com/MartaYang/ONE-SHOT-code.git
cd ONE-SHOT-code
bash install.sh
conda activate oneshot
The installer creates the oneshot conda environment, installs a GPL-enabled ffmpeg, installs PyTorch 2.5.1+cu121, installs the vendored PyTorch3D wheel, and then installs requirements.txt.
Manual installation commands
conda create -n oneshot python=3.12 -y
conda activate oneshot
conda install 'ffmpeg=*=*gpl*' -y
pip install --index-url https://download.pytorch.org/whl/cu121 \
torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121
pip install Preprocessing/third_party/wheels/pytorch3d-0.7.9-cp312-cp312-manylinux_2_31_x86_64.whl
pip install -r requirements.txt
๐ฆ Checkpoints
Download the released checkpoint and set ONESHOT_MODEL_DIR:
hf download MartaYang007/ONE-SHOT-14B \
--local-dir pretrained_models/ONESHOT-14B-diffusers
export ONESHOT_MODEL_DIR=pretrained_models/ONESHOT-14B-diffusers
A recommended local checkpoint layout is:
pretrained_models/
โโโ ONESHOT-14B-diffusers/
โโโ transformer/
โโโ vae/
โโโ text_encoder/
โโโ tokenizer/
โโโ scheduler/
โโโ model_index.json
โโโ preprocess/
โ โโโ human3r.pth
โ โโโ DA3NESTED-GIANT-LARGE-1.1/
โ โโโ smpl_models/
โ โ โโโ smplx/
โ โ โโโ SMPLX_NEUTRAL.npz # YOU MUST DOWNLOAD THIS โ see below
โ โโโ torch_hub/
โโโ demo/
SMPL-X body model (manual download)
The SMPL-X license prohibits third-party redistribution, so SMPLX_NEUTRAL.npz
is not included in the Hugging Face checkpoint.
Register at https://smpl-x.is.tue.mpg.de/ and accept the model license.
From the Downloads page, fetch the latest
SMPL-X (NPZ format)release.Place the file at:
$ONESHOT_MODEL_DIR/preprocess/smpl_models/smplx/SMPLX_NEUTRAL.npz
If the file is missing at runtime, the pipeline fails fast with an error pointing back to this section.
โถ๏ธ Inference
The end-to-end entrypoint is:
bash scripts/run_pipeline.sh <id_swap|motion_swap|scene_swap> [task arguments]
Supported tasks:
| Task | Required inputs | Description |
|---|---|---|
id_swap |
source video, identity profile video or identity profile images, prompt | Replace the actor identity while preserving the source motion and scene. |
motion_swap |
source video, motion video, prompt | Apply a new motion to the original subject and scene. |
scene_swap |
source video, scene video, prompt | Place the person into a different environment. |
Example commands:
ID swap
bash scripts/run_pipeline.sh id_swap \
--video_path "$ONESHOT_MODEL_DIR/demo/walkinforest.mp4" \
--id_profile_video "$ONESHOT_MODEL_DIR/demo/WillSmith.mp4" \
--prompt "A sunlit forest trail with dense green trees and soft natural light filtering through the leaves. Will Smith, wearing a black suit, walks steadily along the forest path while holding a wooden walking stick, looking slightly upward as he moves forward."
# --id_profile_video: recommended โ video with multi-angle coverage of the target person
# (front + side + other angles) for best identity fidelity.
# Alternative: --id_profile_dir <dir> with 4 images named exactly:
# ref1.png (front view) ref2.png (back / 3/4 view) ref3.png (side view) face.png (face crop)
--id_profile_video is recommended when the identity reference contains multi-angle coverage. Alternatively, provide --id_profile_dir <dir> with ref1.png, ref2.png, ref3.png, and face.png.
Motion swap
bash scripts/run_pipeline.sh motion_swap \
--video_path "$ONESHOT_MODEL_DIR/demo/museum4_human.mp4" \
--motion_video_path "$ONESHOT_MODEL_DIR/demo/taiji.mp4" \
--prompt "An indoor space resembling the interior of a museum. A man in a suit is performing tai chi movements."
# to also swap identity: add --id_profile_video $ONESHOT_MODEL_DIR/demo/WillSmith.mp4
# Note: update the prompt accordingly (e.g., gender, name) to match the new identity and avoid conflicts with the video content.
To also swap identity, add --id_profile_video "$ONESHOT_MODEL_DIR/demo/WillSmith.mp4" and update the prompt so the identity description does not conflict with the video content.
Scene swap
bash scripts/run_pipeline.sh scene_swap \
--video_path "$ONESHOT_MODEL_DIR/demo/palace_human.mp4" \
--scene_video_path "$ONESHOT_MODEL_DIR/demo/museum4_scene.mp4" \
--id_profile_video "$ONESHOT_MODEL_DIR/demo/WillSmith.mp4" \
--prompt "An indoor space resembling the interior of a museum. Will Smith is walking, wearing a black suit."
# --scene_video_path: must be a pure background video (no human subjects).
# Strongly recommended to choose a background video whose depth-of-field
# matches the person in --video_path for best generation quality.
# to also swap identity: add --id_profile_video $ONESHOT_MODEL_DIR/demo/WillSmith.mp4
# Note: update the prompt accordingly (e.g., gender, name) to match the new identity and avoid conflicts with the video content.
To preserve the original identity, omit --id_profile_video and update the prompt accordingly.
Generated videos are saved to:
exp/<scheduler>_<task>_<timestamp>/<save_name>.mp4
For example:
exp/lcm_id_swap_20260527_221947/ID_WillSmith-SMPLX_clip_000-006-Scene_C01_gen_xxx_ourGen81.mp4
๐ค Available Models
| Model | Status | Link |
|---|---|---|
| ONE-SHOT-14B | Available | MartaYang007/ONE-SHOT-14B |
๐ Notes
scripts/run_pipeline.shbuilds a task CSV, selects available GPUs with the most free memory, and launchesscripts/inference_short.sh.- Set
ONESHOT_ENVif your conda environment name is notoneshot. - The demo videos are included in the checkpoint download, but using your own videos is recommended for real experiments.
- The prompt should match any swapped identity, motion, or scene to avoid conflicts between text and visual conditions.
๐บ๏ธ TODO
- Multi-GPU inference with FSDP-based model sharding and sequence parallelism.
- ComfyUI support for node-based experimentation and demos.
- Fully compositional generation with explicit identity, motion, scene, and position controls.
๐ Repository Layout
ONE-SHOT-code/
โโโ install.sh # one-shot conda environment installer
โโโ requirements.txt
โโโ tools/
โ โโโ inference_short.py # short-video inference entrypoint
โ โโโ inference_long.py # long-video inference with scene memory
โโโ scripts/
โ โโโ run_pipeline.sh # preprocessing + inference pipeline
โ โโโ inference_short.sh
โ โโโ inference_long.sh
โโโ Preprocessing/
โ โโโ preprocess_video.py # per-video preprocessing
โ โโโ build_csv.py # task CSV builder
โ โโโ make_scene_masked_dilate.py
โ โโโ third_party/ # vendored DUSt3R, Human3R, CroCo, and utilities
โโโ oneshot_diffusers/ # ONE-SHOT diffusers overrides
โ โโโ transformer_wan_oneshot.py
โ โโโ pipeline_wan_oneshot.py
โ โโโ oneshot_util.py
โโโ utils/ # ODE solvers and video I/O helpers
โโโ datasets/ # DWPose drawing and data utilities
๐ Acknowledgement
This project builds on:
๐ Citation
@misc{yang2026oneshot,
title={ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration},
author={Fengyuan Yang and Luying Huang and Jiazhi Guan and Quanwei Yang and Dongwei Pan and Jianglin Fu and Haocheng Feng and Wei He and Kaisiyuan Wang and Hang Zhou and Angela Yao},
year={2026},
eprint={2604.01043},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.01043}
}
- Downloads last month
- 31


