To build Learnable Game Engines, we contribute two annotated monocular video datasets that we will make publicly available. Differently from existing text-video datasets with a single and generic caption per video or captions weakly aligned to video content, our datasets feature a text action for each player and frame in the video that describes in detail what the player is doing using technical terms.
15 hours of annotated cameras, 3D skeletons, 3D ball and manually-annotated actions for each frame and player.
1 hour of annotated cameras, 3D skeletons and actions for each frame. Annotation automatically produced using a Minecraft plugin we will make publicly available.