InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Rick Akkerman¹^2* Haiwen Feng^1* Michael J. Black¹ Dimitrios Tzionas² Victoria Fernández Abrevaya¹

¹ Max Planck Institute for Intelligent Systems ²University of Amsterdam

teaser — InterDyn: Given an image and a driving motion in the form of a mask sequence, our model generates scene dynamics that are physically plausible without 3D reconstruction and physics simulation. We investigate the generated interactive dynamics under simulated settings and in-the-wild HOI scenarios.

Summary

We adress the task of synthesizing human-object interactions in image space. Given an input image and a motion cue (e.g., a sequence of hand masks), InterDyn generates plausible videos that realistically capture human motion and its interaction with objects in the scene. This work demonstrates how foundation models can serve as neural physics engines, generating realistic scene dynamics without the need for explicit physical simulation.

Object collision events

We start our investigation by probing the ability of SVD to generate interactive dynamics using the synthetic CLEVRER dataset. We examine whether the model can produce object movements for uncontrolled objects in the scene, given the movement of objects entering the scene. For instance, determining how an object would move if struck by another.

Single-object force propagation

Compared to ground truth physics-simulated renderings, InterDyn synthesizes dynamics that are physically plausible without explicit knowledge of the objects’ mass or stiffness.

Input

Ours

Ground Truth

Multi-object force propagation

InterDyn can also generate force propagation dynamics, where uncontrolled objects interact with each other.

Input

Ours

Ground Truth

Counterfactual Dynamics

InterDyn can generate realistic motions under counterfactual scenarios, where different interaction configurations lead to different causal effects—much like a physics simulator.

Input

Futures

Hand-Object Interaction

We continue by evaluating InterDyn in a more complex, real-world scenarios offered by the Something-Something-v2 dataset. Originally proposed for human action recognition and video understanding, this dataset provides 220,847 videos of humans performing basic actions with everyday objects.

Results

Input

Ours

Baseline comparisons

Sudhakar et al. recently proposed CosHand, a controllable image-to-image model based on Stable Diffusion that infers state transitions of an object. We compare InterDyn with two CosHand variants: a frame-by-frame approach and an auto-regressive approach.

Input

Ours

CosHand frame-by-frame

CosHand auto-regressive