Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

fangqi-Zhu · 2025-01-09T15:18:12Z

In the paper, it is mentioned that when predicting videos from the Bridge dataset, the given condition is the current frame and the action, with the goal of predicting the next frame.

I would like to clarify: is only a single current frame used for prediction, or would incorporating multiple historical frames have an impact on performance?
Additionally, is the entire generated video produced purely via rollout in the WFM, where only the first frame of the episode and the full action trajectory are provided, or is the current frame at each timestep given as ground truth? Because the generated video doesn’t seem to gradually become blurry, which is really amazing.

Thank you for your outstanding work!

mharrim added the question Further information is requested label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

fangqi-Zhu commented Jan 9, 2025

Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

Comments

fangqi-Zhu commented Jan 9, 2025