BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Yiming Wang1,2, Qihang Zhang3*, Shengqu Cai2*, Tong Wu2†, Jan Ackermann2†,
Zhengfei Kuang2†, Yang Zheng2†, Frano Rajič1†, Siyu Tang1, Gordon Wetzstein2

1ETH Zurich    2Stanford University    3CUHK

*,†Equal contribution

TL;DR Time- and camera-controlled 4D video generation. Given a single input video where camera motion is entangled with linear world time, our method synthesizes new videos that enable decoupled control over world time and camera pose.
Our method enables bullet-time effects (i.e., freely moving the camera while freezing or slowing scene dynamics) for a wide range of real-world videos.

Method Overview

Overview of our 4D-controllable Diffusion Transformer.

4D-controllable Video Diffusion. Given a conditional input video, our diffusion model generates new videos under 4D control using world time and camera trajectory. These two signals are injected into the Diffusion Transformer through complementary modulation pathways. Time control is provided through a time-aware positional encoding RoPEt, injected into the attention layers, and a corresponding adaptive LayerNorm with MLPt that predicts affine scale and shift parameters used to modulate intermediate activations. Camera control is introduced analogously through a camera-aware positional encoding RoPEc and an accompanying modulation MLP. Outputs from RoPEt and RoPEc are fused into a unified 4D positional encoding that conditions the attention layers. Together, these mechanisms form a 4D-controllable DiT block capable of jointly steering temporal evolution and camera motion during generation. We train our model on a curated, fully controllable 4D synthetic dataset in which time and camera factors vary independently, providing explicit supervision for disentangling the two controls.

🎥 Results

Diversity of Camera Motions

Input Videos from ReCamMaster

Diversity of Time Control

Input video from ReCamMaster, and Wan2.2 .

BibTeX

TODO