Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fixe FSDP saving error #593

Closed
wants to merge 1 commit into from
Closed

[WIP] Fixe FSDP saving error #593

wants to merge 1 commit into from

Conversation

zhisbug
Copy link
Collaborator

@zhisbug zhisbug commented Apr 25, 2023

@zhisbug
Copy link
Collaborator Author

zhisbug commented Apr 25, 2023

pending test by @ZYHowell

@zhisbug zhisbug requested a review from ZYHowell April 25, 2023 08:20
@merrymercy merrymercy force-pushed the main branch 2 times, most recently from d2e3961 to be29f26 Compare April 30, 2023 01:53
@merrymercy
Copy link
Member

@ZYHowell @zhisbug Any updates or close this?

@alanxmay
Copy link
Contributor

alanxmay commented May 13, 2023

@merrymercy I can help with the test, since I had the same problem before. Update results later.


update

Try this PR with 4*A100(80G), training is ok, OOM when saving.

I might dig into this later.

@alanxmay
Copy link
Contributor

alanxmay commented May 15, 2023

@merrymercy @zhisbug Tried several different settings using the FSDP API, all failed when saving the model.

But based on this comment, I finally managed to save the model with python3.10 and torch==2.0 by change /python3.10/site-packages/torch/distributed/fsdp/_state_dict_utils.py on line 309 from state_dict[fqn] = state_dict[fqn].clone().detach() to state_dict[fqn] = state_dict[fqn].cpu().clone().detach()

Test machine: 4*A100(80G).

@zhisbug
Copy link
Collaborator Author

zhisbug commented May 15, 2023

@alanxmay this is just a workaround. Most of our users indeed used this workaround.

@zhisbug
Copy link
Collaborator Author

zhisbug commented May 15, 2023

Closing this PR, I am going to start a new PR with the fix.

@zhisbug zhisbug closed this May 15, 2023
@merrymercy merrymercy deleted the hao-fix-fsdp branch May 15, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants