Keywords: Program Synthesis, World Models, Large Language Models
Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes and finds wide-spread use in applications such as video surveillance, robotics applications, autonomous driving, etc. This task entails synthesizing plausible visual futures, given a few frames of a video - necessary to set the visual context for the synthesis. Towards this end, we propose ProgGen - which undertakes the task of video frame prediction by synthesizing computer programs which represent the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize computer programs to: (i) estimate the states of the video, given the visual context (i.e. the frames); (ii) predict the states corresponding to future time steps by estimating the transition dynamics; (iii) render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld and (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and editability attesting to its effectiveness and generalizability.
Submission Number: 47
Loading