Deep Learning has made historic progress in recent years by producing systems that rival — and in some cases exceed — human performance in tasks such as recognizing objects in still images. Despite this progress, enabling computers to understand both the spatial and temporal aspects of video remains an unsolved problem. The reason is sheer complexity. While a photo is just one static image, a video shows narrative in motion. Video is time-consuming to annotate manually, and it is computationally expensive to store and process.
The main obstacle that prevents neural networks from reasoning more fundamentally about complex scenes and situations is their lack of common sense knowledge about the physical world. Video data contains a wealth of fine-grained information about the world as it shows how objects behave by virtue of their properties. For example, videos implicitly encode physical information like three-dimensional geometry, material properties, object permanence, affordance or gravity. While we humans intuitively grasp these concepts, a detailed understanding of the physical world is still largely missing from current applications in artificial intelligence (AI) and robotics.