You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, it is mentioned that when predicting videos from the Bridge dataset, the given condition is the current frame and the action, with the goal of predicting the next frame.
I would like to clarify: is only a single current frame used for prediction, or would incorporating multiple historical frames have an impact on performance?
Additionally, is the entire generated video produced purely via rollout in the WFM, where only the first frame of the episode and the full action trajectory are provided, or is the current frame at each timestep given as ground truth? Because the generated video doesn’t seem to gradually become blurry, which is really amazing.
Thank you for your outstanding work!
The text was updated successfully, but these errors were encountered:
In the paper, it is mentioned that when predicting videos from the Bridge dataset, the given condition is the current frame and the action, with the goal of predicting the next frame.
I would like to clarify: is only a single current frame used for prediction, or would incorporating multiple historical frames have an impact on performance?
Additionally, is the entire generated video produced purely via rollout in the WFM, where only the first frame of the episode and the full action trajectory are provided, or is the current frame at each timestep given as ground truth? Because the generated video doesn’t seem to gradually become blurry, which is really amazing.
Thank you for your outstanding work!
The text was updated successfully, but these errors were encountered: