Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Regarding Auto-Regressive Video Generation for Robot Manipulation #29

Open
fangqi-Zhu opened this issue Jan 9, 2025 · 0 comments
Labels
question Further information is requested

Comments

@fangqi-Zhu
Copy link

In the paper, it is mentioned that when predicting videos from the Bridge dataset, the given condition is the current frame and the action, with the goal of predicting the next frame.

I would like to clarify: is only a single current frame used for prediction, or would incorporating multiple historical frames have an impact on performance?
Additionally, is the entire generated video produced purely via rollout in the WFM, where only the first frame of the episode and the full action trajectory are provided, or is the current frame at each timestep given as ground truth? Because the generated video doesn’t seem to gradually become blurry, which is really amazing.

Thank you for your outstanding work!

@mharrim mharrim added the question Further information is requested label Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants