4D-controllable Video Diffusion.
Given a conditional input video, our diffusion model generates new videos under 4D control using world time and camera trajectory. These two signals are injected into the Diffusion Transformer through complementary modulation pathways.
Time control is provided through a time-aware positional encoding RoPEt,
injected into the attention layers, and a corresponding adaptive LayerNorm with MLPt that predicts affine scale and shift parameters used to modulate intermediate activations.
Camera control is introduced analogously through a camera-aware positional encoding
RoPEc and an accompanying modulation MLP. Outputs from
RoPEt and RoPEc are fused into a unified
4D positional encoding that conditions the attention layers.
Together, these mechanisms form a 4D-controllable DiT block capable of jointly steering
temporal evolution and camera motion during generation. We train our model on a curated,
fully controllable 4D synthetic dataset in which time and camera factors vary independently,
providing explicit supervision for disentangling the two controls.