ServiceNow Research

VIM: Variational Independent Modules for Video Prediction


We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.

Causal Learning and Reasoning (CLeaR)
Rim Assouel
Rim Assouel
Visiting Researcher

Visiting Researcher at Low Data Learning located at Montreal, QC, Canada.

Yoshua Bengio
Yoshua Bengio
Research Advisor

Research Advisor at Human Decision Support located at Montreal, QC, Canada.