Results for the task of predicting all video frames starting from the first frame and using text actions for each frame.