Game engines are powerful tools in computer graphics. Their power comes at the immense cost of their development. In this work, we present a framework to train game-engine-like neural models, solely from monocular annotated videos. The result—a Learnable Game Engine (LGE)—maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. Similarly to a game engine, it models the logic of the game and the underlying rules of physics, to make it possible for a user to play the game by specifying both high- and low-level action sequences. Most captivatingly, our LGE unlocks the director's mode, where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents in the form of language and desired states. This requires learning “game AI”, encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, devise the strategy to win a point. The key to learning such game AI is the exploitation of a large and diverse text corpus, collected in this work, describing detailed actions in a game and used to train our animation model. To render the resulting state of the environment and its agents, we use a compositional NeRF representation used in our synthesis model. To foster future research, we present newly collected, annotated and calibrated large-scale Tennis and Minecraft datasets. Our method significantly outperforms existing neural video game simulators in terms of rendering quality. Besides, our LGEs unlock applications beyond capabilities of the current state of the art. Our framework, data, and models are publicly available.
We propose Learnable Game Engines (LGEs), a framework to learn games from videos. Like a game engine, our learned model can synthesize scenes with explicit control on pose of articulated objects, their position, camera and style.
Similarly to a game engine, our model learns a representation of physics and of the game logic. A player can generate results by issuing actions. To unlock a high degree of expressiveness and enable fine-grained action control, we make use of text actions.
The knowledge about physics and the game logic enables our model to perform complex reasoning on the scene and enables the "director's mode" where high-level constraints or objectives can be specified and the model generates sequences containing complex action sequences that satisfy those constraints.
As a simple example, our model can be given an initial and final state and generate all the trajectory in the middle:
If a further conditioning action is given in the middle of the video, the model generates a completion satisfying also the additional constraint and changes the path of the player accordingly:
More conditioning actions can be inserted at different times to generate multiple waypoints:
The model can be asked to perform complex reasoning such as which actions to take to win a point. To do so, the user can condition the top player with the action that he should "not catch the ball" at the end of the sequence:
To enable learning game engines, we build two dataset with camera calibration, 3D player poses, 3D ball localization and fine-grained text actions for each player and each frame: