We develop a video synthesis framework to support a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts.
The framework supports population-level sampling with a variational inference framework (see Method). Given a set of user inputs (texts, images, camera poses, and simulation trajectories for objects in interest), running the framework yields a population of samples (4 shown below).
"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
"A majestic seagull strides through a shallow puddle as other gulls mill about in the blurred background. A delicate transparent glass sphere falls from above, striking the water and creating a crisp splash of droplets. Camera moves left."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
"A close-up of a vibrant red flower slowly blooming, its petals unfolding gracefully in the sunlight. A honeybee hovers delicately, buzzing around the petals. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
"A playful puppy energetically jumps up and down beside a stone fountain. Crystal-clear streams of water arc gracefully from the fountain into the basin below, sparkling in the bright daylight. Camera orbits right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
"A man sits calmly in front of a softly crackling campfire beside his camper van, bathed in the warm evening glow. Suddenly, the flames roar to life—flaring high with a burst of sparks and smoke. The man's eyes widen in shock as he starts to be panicked. Camera moves left."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera is fixed."
Input Text
Input Image
Input Camera Poses
Input Simulation
Particle 1
Particle 2
Particle 3
Particle 4
Supporting object simulation trajectories as input controls allows for precise control over the output content. Toggle buttons below to view outputs with different simulation trajectory inputs.
"A playful puppy energetically jumps up and down beside a stone fountain. Crystal-clear streams of water arc gracefully from the fountain into the basin below, sparkling in the bright daylight. Camera orbits right."
Input Text
Input Image
Input Camera Poses
"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera is fixed."
Input Text
Input Image
Input Camera Poses
"In a dimly lit billiards hall, a focused woman leans in with precision, her eyes locked on the cue ball. She is using the pool cue to hit the white ball and the white ball is then bouncing the black ball. Camera moves right."
Input Text
Input Image
Input Camera Poses
"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."
Input Text
Input Image
Input Camera Poses
Below are comparisons with baselines including monolithic models (video generation models and novel view synthesis models), and compositional methods. Monolithic models typically don't account for all user inputs, while existing compositional methods tend to produce scene inconsistency.
"A close-up shot of a delicate white porcelain cup filled with creamy latte art, cradled gently in both hands. A perfectly clear, photorealistic ice sphere drops from above, striking the surface with a crisp splash. Camera spirals."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"In a dimly lit billiards hall, a focused woman leans in with precision, her eyes locked on the cue ball. She is using the pool cue to hit the white ball and the white ball is then bouncing the black ball. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"A playful puppy energetically jumps up and down beside a stone fountain. Crystal-clear streams of water arc gracefully from the fountain into the basin below, sparkling in the bright daylight. Camera orbits right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"An energetic golden retriever bounds across a lush green tennis court, chasing after a bright yellow tennis ball with focused determination. The ball bounces high as the dog leaps to catch it. Camera moves right."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
"A sleek black cat crouches low on a wooden floor, eyes locked on a small red ball rolling towards it. The cat pounces with lightning speed, batting the ball with its paw. Camera moves left."
Input Text
Input Image
Input Camera Poses
Input Simulation
Ours
Image-to-Video
The same technique can be extended to other applications, such as generating videos longer than the standard output lengths from existing diffusion/flow-based video models.
"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed and reopen. Camera is fixed."
Input Text
Input Image
Input Camera Poses
Input Simulation
Generated Long Video
"A close-up of a vibrant red flower slowly blooming, its petals unfolding gracefully in the sunlight. A honeybee hovers delicately, buzzing around the petals. Camera is fixed and then spirals."
Input Text
Input Image
Input Camera Poses
Input Simulation
Generated Long Video
"A playful puppy energetically jumps up and down beside a stone fountain. Crystal-clear streams of water arc gracefully from the fountain into the basin below, sparkling in the bright daylight. Camera orbits right and then orbits left."
Input Text
Input Image
Input Camera Poses
Input Simulation
Generated Long Video
"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right and then left."
Input Text
Input Image
Input Camera Poses
Input Simulation
Generated Long Video
Another application is to insert objects into videos as shown below.
Input Video
Edited Video
Input Video
Edited Video
Input Video
Edited Video
Input Video
Edited Video
The task is cast as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, the problem is broken down into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a conditional factorization technique with 3D-aware conditionals that reduces modes in the solution space to circumvent local optima, improving scene consistency compared to baselines. A toy illustration is shown here.
@article{duan2025controllable, title={Controllable Video Synthesis via Variational Inference}, author={Duan, Haoyi and Zhang, Yunzhi and Du, Yilun and Wu, Jiajun}, journal={arXiv preprint arXiv:2510.07670}, year={2025} }