A unified framework for camera-controlled video generation that accepts video, pose, or text, all describing the same camera trajectory, and maps them into a shared motion embedding space.
- [2026-06] Our paper is accepted at ECCV2026.
- [2026-04] Code, checkpoints, and the Motion Triplet Dataset are released.
Existing camera-control methods are typically restricted to a single input modality — pose-conditioned methods require precise geometric trajectories, reference-video methods lack explicit control, and text-based methods struggle with temporal consistency. TriMotion addresses all three limitations in one framework.
Key contributions
- Unified Motion Embedding Space — aligns video, pose, and text in a shared representation via contrastive learning, temporal synchronization, and geometric fidelity regularization.
- Motion Triplet Dataset — 136K synchronized (video, pose, text) triplets built on top of the MultiCamVideo Dataset with LLM-generated, geometry-grounded captions.
- Latent Motion Consistency — a Motion Embedding Predictor that enforces trajectory fidelity directly in latent space, avoiding costly pixel-space decoding.
Built on top of Wan2.1 and supports both I2V and V2V camera-controlled generation.
git clone https://github.com/seunghyuns98/TriMotion.git
cd TriMotion
conda create -n TriMotion python=3.10
conda activate TriMotion
pip install -r requirements.txt
#Install Flash Attention
pip install flash_attn==2.8.3 --no-build-isolation
or
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp310-cp310-linux_x86_64.whlWe use Wan2.1-T2V-1.3B as the diffusion backbone, along with an additional CLIP checkpoint required for the I2V branch.
# 1) Wan2.1-T2V-1.3B (T5 text encoder, VAE, DiT)
hf download Wan-AI/Wan2.1-T2V-1.3B \
--local-dir checkpoint/Wan2.1-T2V-1.3B
# 2) CLIP image encoder (open-clip-xlm-roberta-large-vit-huge-14)
hf download DeepBeepMeep/Wan2.1 \
models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth \
--revision 8bee6e003d1d9d31ecb2c75b643e57fa74fb2ad5 \
--local-dir ./checkpoint/Wan2.1-T2V-1.3BAfter downloading, the checkpoint/ directory should look like:
checkpoint/
└── Wan2.1-T2V-1.3B/
├── models_t5_umt5-xxl-enc-bf16.pth
├── Wan2.1_VAE.pth
├── diffusion_pytorch_model.safetensors
└── models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
You also need the TriMotion-specific checkpoints, available from our Google Drive folder. This folder also includes the I2V model from CamCloneMaster-Wan2.1, which we use as the initialization for our DiT fine-tuning. Download the entire folder with gdown:
pip install gdown
# Download the whole TriMotion checkpoint folder into ./checkpoint/trimotion/
gdown --folder https://drive.google.com/drive/folders/1tQznlZwoSTFRzDhgmikVCGAbDiO6YAVs \
-O ./checkpoint/trimotion💡 If
gdownhits a quota error for large files, re-run the same command — partially downloaded files will be resumed. For very large files you may needgdown --fuzzy <file-url>on individual items.
| File | Description |
|---|---|
embedding_space.ckpt |
Unified Motion Embedding Space — video / text / pose encoders |
vae_projection.ckpt |
Motion Embedding Predictor — latent → motion embedding |
trimotion.ckpt |
Wan2.1 fine-tuned DiT — camera-controlled I2V / V2V |
i2v_baseline.ckpt |
I2V initialization from CamCloneMaster-Wan2.1, used as the starting point for DiT fine-tuning |
aggregator.ckpt |
VGGT Aggregator weights extracted from VGGT — initializes the video motion encoder (required for both training and inference) |
demo.py runs a single-example inference. You must provide a source video (--content_video), a scene prompt (--prompt), and at least one camera reference among --ref_video / --ref_text / --ref_pose.
| Flag | Accepts | Example |
|---|---|---|
--ref_video |
.mp4 / any decord-readable video |
--ref_video path/to/ref.mp4 |
--ref_text |
raw string or .txt file path |
--ref_text path/to/ref.txt |
--ref_pose |
.json (per-cam extrinsics) / .npy / .pt |
--ref_pose path/to/ref.json |
--promptalso accepts either a raw string or a.txtfile path.- If you are using examples/src_videos ref_video and prompt should have same number without ext.
--mode—i2vorv2v(defaultv2v).--first_frame— optional image for I2V first frame. If omitted, the first frame of--content_videois used.--dit_num_frames,--dit_height,--dit_width— output shape (default 81 / 384 / 672).--cfg_scale,--num_inference_steps,--seed— standard diffusion controls.--output_name— output filename stem (defaultgenerated.mp4). The ref-modality tag is appended automatically, e.g.generated_video.mp4,generated_text.mp4.
Reference video only:
python demo.py \
--content_video examples/src_videos/1.mp4 \
--prompt examples/prompt/1.txt \
--ref_video examples/ref_videos/1.mp4Reference text only:
python demo.py \
--content_video examples/src_videos/1.mp4 \
--prompt examples/prompt/1.txt \
--ref_text examples/ref_texts/cam04.txt Reference pose only (JSON extrinsics):
python demo.py \
--content_video examples/src_videos/1.mp4 \
--prompt examples/prompt/1.txt \
--ref_pose examples/ref_poses/cam02.jsondemo_multimodal.py combines exactly two reference modalities from --ref_video / --ref_text / --ref_pose by fusing their motion embeddings. Two fusion modes are supported:
--type interpolation— linearly blends the two embeddings:target = scale · e₀ + (1 − scale) · e₁. Set the blend with--scale(0.0–1.0).--type sequential— concatenates the two motion sequences in time to form a compound trajectory. Use--order {video,text,pose}to pick which provided modality goes first; the other one goes second.
python demo_multimodal.py \
--content_video examples/src_videos/2.mp4 \
--prompt examples/prompt/2.txt \
--ref_pose examples/ref_poses/cam01.json \
--ref_text examples/ref_texts/cam05.txt \
--type sequential \
--order pose \
--output_dir ./results/multimodal \
--mode v2vInterpolate between a reference video and a reference text (50/50 blend):
python demo_multimodal.py \
--content_video examples/src_videos/2.mp4 \
--prompt examples/prompt/2.txt \
--ref_pose examples/ref_poses/cam01.json \
--ref_text examples/ref_texts/cam05.txt \
--type interpolation \
--scale 0.5 \
--output_dir ./results/multimodalSequential composition (text motion first, then pose motion):
python demo_multimodal.py \
--content_video examples/src_videos/2.mp4 \
--prompt examples/prompt/2.txt \
--ref_pose examples/ref_poses/cam01.json \
--ref_text examples/ref_texts/cam05.txt \
--type sequential \
--order pose \
--output_dir ./results/multimodalAll other flags (--mode, --first_frame, --dit_num_frames/height/width, --cfg_scale, --num_inference_steps, --seed, --output_name) behave the same as in demo.py.
We release the Motion Triplet Dataset, built upon the MultiCamVideo Dataset (136K videos, 13.6K scenes, 40 Unreal Engine 5 environments) by adding geometry-grounded motion descriptions.
Preparation
- Download the MultiCamVideo Dataset into the
MotionTriplet-Dataset/directory. - Download our Motion Descriptions from Google Drive.
- Run the preparation script:
python preparing_dataset.pyDirectory structure
MotionTriplet-Dataset/
├── train/
│ └── f00/
│ └── scene1/
│ ├── cameras/
│ │ ├── camera_extrinsics.json
│ │ ├── text_description_long.json
│ │ └── text_description_short.json
│ ├── videos/
│ │ ├── cam01.mp4
│ │ ├── ...
│ │ └── cam10.mp4
│ ├── text/
│ └── text_description.json
│ └── merged_conditions.json
Trains motion encoders for all three modalities with a composite loss: global InfoNCE alignment, temporal synchronization, and geometric fidelity regularization.
python train_embedding_space.py \
--output_path ./checkpoint/embedding_space💡 Common knobs:
--batch_size(default24),--learning_rate(default1e-4),--max_epochs(default100),--training_strategy(deepspeed_stage_1|2|3), and--resume_ckpt_pathto continue from a checkpoint.
Trains the predictor (3D convolutions + temporal Transformer) to estimate motion embeddings from VAE latents, using a dual-granularity cosine similarity loss (global + frame-wise).
python train_motion_embedding_projector.py \
--cam_ckpt_path PATH TO YOUR CAM ENCODER WEIGHT \
--output_path ./checkpoint/motion_embedding_projector💡 Common knobs:
--batch_size(default8),--learning_rate(default1e-4),--max_epochs(default10),--training_strategy,--resume_ckpt_path.
Precompute and cache embeddings before training:
python latent_preprocess.py \
--cam_encoder_ckpt_path PATH TO YOUR CAM ENCODER WEIGHT💡 You may also tune
--num_frames(default21),--height/--width(default224/448),--batch_size(default32), and--dataloader_num_workers(default8) to match your hardware.
Fine-tunes WAN-Video with motion embedding conditioning via block-specific projection MLPs. Jointly trains I2V and V2V with equal probability per iteration.
train_trimotion.py \
--vae_projector_ckpt_path PATH TO YOUR VAE PROJECTOR WEIGHT \
--output_path ./checkpoint/tri_motion💡 Common knobs:
--batch_size(default4),--accumulate_grad_batches(default4),--learning_rate(default1e-4),--max_epochs(default10),--num_frames/--height/--width(default81/384/672),--training_strategy(defaultdeepspeed_stage_2),--resume_ckpt_path.
Geometry-grounded camera motion captions are produced with Qwen3. Shared logic lives in caption_generator.py; use the scripts below for each workflow.
| File | Purpose |
|---|---|
caption_generator.py |
Pose parsing, motion analysis, prompts, Qwen inference (imported by the scripts) |
caption_motion_triplet.py |
Batch captions for the Motion Triplet Dataset |
caption_pose.py |
Long captions for individual pose files (e.g., RealEstate10K) |
Install the extra dependency (not required for inference/training above):
pip install scipyAfter placing MultiCamVideo under MotionTriplet-Dataset/, run caption generation. Outputs are written next to each scene as cameras/text_description_long.json and cameras/text_description_short.json. Both camera_extrinsics.json and merged merged_conditions.json are supported.
# Long + short captions for the train split (default dataset path: ./MotionTriplet-Dataset)
python caption_motion_triplet.py --mode train
# Validation split, long captions only
python caption_motion_triplet.py --mode val --caption-type long| Flag | Description |
|---|---|
--dataset-path |
MotionTriplet-Dataset root (default: ./MotionTriplet-Dataset) |
--mode |
train or val |
--caption-type |
long, short, or both (default both) |
--model-name |
Hugging Face Qwen3 id (default Qwen/Qwen3-4B-Instruct-2507) |
Pass one or two pose files directly. Generates long captions only. Supports .txt trajectories (RealEstate10K-style extrinsics) and .json files in the same format as examples/ref_poses/.
# One RealEstate10K pose file
python caption_pose.py \
--pose /path/to/realestate10k/poses/00001.txt \
--output-dir ./cam_captions
# Try two example poses from this repo
python caption_pose.py \
--pose examples/ref_poses/cam01.json examples/ref_poses/cam02.json \
--output-dir ./cam_captions| Flag | Description |
|---|---|
--pose |
One or more pose paths (.txt or .json) |
--output-dir |
Output directory (default ./cam_captions; files named {pose}_long.txt) |
--frame-interval / --num-frames |
Sampling for .txt poses (default 4 / 21) |
- WAN-Video — diffusion backbone
- VGGT — video motion encoder
- ReCamMaster — Multi-Cam Video Dataset
- CamCloneMaster — I2V initialization for DiT fine-tuning
- Qwen3 — geometry-grounded caption generation
- Hugging Face Transformers — T5 text encoder
If you find TriMotion useful in your research, please cite our paper:
@inproceedings{shin2026trimotion,
title={TriMotion: Modality-Agnostic Camera Control for Video Generation},
author={Shin, Seunghyun and Song, Jifei and Jeon, Wooseok and Jeon, Hae-Gon and Deng, Jiankang},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}