Ctrl-VI: Controllable Video Synthesis via Variational Inference

Haoyi Duan^1*

Yunzhi Zhang^1*

Yilun Du²

Jiajun Wu¹

¹Stanford University

²Harvard University

^*Equal Contribution

arXiv

Code (Coming Soon)

We develop a video synthesis framework to support a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts.

Video Synthesis

The framework supports population-level sampling with a variational inference framework (see Method). Given a set of user inputs (texts, images, camera poses, and simulation trajectories for objects in interest), running the framework yields a population of samples (4 shown below).

"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

"A majestic seagull strides through a shallow puddle as other gulls mill about in the blurred background. A delicate transparent glass sphere falls from above, striking the water and creating a crisp splash of droplets. Camera moves left."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

"A close-up of a vibrant red flower slowly blooming, its petals unfolding gracefully in the sunlight. A honeybee hovers delicately, buzzing around the petals. Camera moves right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

"A playful puppy energetically jumps up and down beside a stone fountain. Crystal-clear streams of water arc gracefully from the fountain into the basin below, sparkling in the bright daylight. Camera orbits right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

"A man sits calmly in front of a softly crackling campfire beside his camper van, bathed in the warm evening glow. Suddenly, the flames roar to life—flaring high with a burst of sparks and smoke. The man's eyes widen in shock as he starts to be panicked. Camera moves left."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera is fixed."

Input Text

Input Image

Input Camera Poses

Input Simulation

Particle 1

Particle 2

Particle 3

Particle 4

Supporting object simulation trajectories as input controls allows for precise control over the output content. Toggle buttons below to view outputs with different simulation trajectory inputs.

Input Text

Input Image

Input Camera Poses

"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera is fixed."

Input Text

Input Image

Input Camera Poses

"In a dimly lit billiards hall, a focused woman leans in with precision, her eyes locked on the cue ball. She is using the pool cue to hit the white ball and the white ball is then bouncing the black ball. Camera moves right."

Input Text

Input Image

Input Camera Poses

"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."

Input Text

Input Image

Input Camera Poses

Comparisons

Below are comparisons with baselines including monolithic models (video generation models and novel view synthesis models), and compositional methods. Monolithic models typically don't account for all user inputs, while existing compositional methods tend to produce scene inconsistency.

"A close-up shot of a delicate white porcelain cup filled with creamy latte art, cradled gently in both hands. A perfectly clear, photorealistic ice sphere drops from above, striking the surface with a crisp splash. Camera spirals."

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed. Camera moves right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

"An energetic golden retriever bounds across a lush green tennis court, chasing after a bright yellow tennis ball with focused determination. The ball bounces high as the dog leaps to catch it. Camera moves right."

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

"A sleek black cat crouches low on a wooden floor, eyes locked on a small red ball rolling towards it. The cat pounces with lightning speed, batting the ball with its paw. Camera pulls out."

Input Text

Input Image

Input Camera Poses

Input Simulation

Ours

Image-to-Video

Extended Applications

The same technique can be extended to other applications, such as generating videos longer than the standard output lengths from existing diffusion/flow-based video models.

"A playful dog romps around on a cozy couch in a softly lit living room. The light dims-curtains slowly glide closed and reopen. Camera is fixed."

Input Text

Input Image

Input Camera Poses

Input Simulation

Generated Long Video

"A close-up of a vibrant red flower slowly blooming, its petals unfolding gracefully in the sunlight. A honeybee hovers delicately, buzzing around the petals. Camera is fixed and then spirals."

Input Text

Input Image

Input Camera Poses

Input Simulation

Generated Long Video

Input Text

Input Image

Input Camera Poses

Input Simulation

Generated Long Video

"A playful brown-and-white dog crouches excitedly on a patterned rug in a cozy living room, watching a shiny red balloon bounce gently up and down on the carpet. Camera moves right and then left."

Input Text

Input Image

Input Camera Poses

Input Simulation

Generated Long Video

Another application is to insert objects into videos as shown below.

Input Video

Edited Video

Input Video

Edited Video

Input Video

Edited Video

Input Video

Edited Video

Method

The task is cast as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, the problem is broken down into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a conditional factorization technique with 3D-aware conditionals that reduces modes in the solution space to circumvent local optima, improving scene consistency compared to baselines. A toy illustration is shown here.

BibTeX

@article{duan2025controllable,
    title={Controllable Video Synthesis via Variational Inference},
    author={Duan, Haoyi and Zhang, Yunzhi and Du, Yilun and Wu, Jiajun},
    journal={arXiv preprint arXiv:2510.07670},
    year={2025}
}