• Home
InterDyn InterDyn
  • Home

InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

Rick Akkerman1 2* Haiwen Feng1* Michael J. Black1 Dimitrios Tzionas2 Victoria Fernández Abrevaya1
1 Max Planck Institute for Intelligent Systems    2University of Amsterdam
ArXiv Coming soon
teaser
InterDyn: Given an image and a driving motion in the form of a mask sequence, our model generates scene dynamics that are physically plausible without 3D reconstruction and physics simulation. We investigate the generated interactive dynamics under simulated settings and in-the-wild HOI scenarios.

Summary

We adress the task of synthesizing human-object interactions in image space. Given an input image and a motion cue (e.g., a sequence of hand masks), InterDyn generates plausible videos that realistically capture human motion and its interaction with objects in the scene. This work demonstrates how foundation models can serve as neural physics engines, generating realistic scene dynamics without the need for explicit physical simulation.

Object collision events

We start our investigation by probing the ability of SVD to generate interactive dynamics using the synthetic CLEVRER dataset. We examine whether the model can produce object movements for uncontrolled objects in the scene, given the movement of objects entering the scene. For instance, determining how an object would move if struck by another.

Single-object force propagation

Compared to ground truth physics-simulated renderings, InterDyn synthesizes dynamics that are physically plausible without explicit knowledge of the objects’ mass or stiffness.

Input
Ours
Ground Truth
Multi-object force propagation

InterDyn can also generate force propagation dynamics, where uncontrolled objects interact with each other.

Input
Ours
Ground Truth
Counterfactual Dynamics

InterDyn can generate realistic motions under counterfactual scenarios, where different interaction configurations lead to different causal effects—much like a physics simulator.

Input
Futures

Hand-Object Interaction

We continue by evaluating InterDyn in a more complex, real-world scenarios offered by the Something-Something-v2 dataset. Originally proposed for human action recognition and video understanding, this dataset provides 220,847 videos of humans performing basic actions with everyday objects.

Results
Input
Ours
Baseline comparisons

Sudhakar et al. recently proposed CosHand, a controllable image-to-image model based on Stable Diffusion that infers state transitions of an object. We compare InterDyn with two CosHand variants: a frame-by-frame approach and an auto-regressive approach.

Input
Ours
CosHand frame-by-frame
CosHand auto-regressive
© 2024 Max-Planck-Gesellschaft - Imprint- Privacy Policy
© 2024 Max-Planck-Gesellschaft