pytorch-fsdp-multi-node

Also, look at Parameter Server approach in Tensorflow My repository with training in Parameter Server paradigm with Tensorflow

Training ResNet50 in Fully Sharded Data Parallel (FSDP) approach at kubernetes cluster with help of Kubeflow.

Landmark 2020 is used, but you can switch to MNIST with commented lines.

ResNet is used, but you can switch to small "Net" by uncomment lines.

Validation and inference is not ready yet.

There is LandmarkDataset class that inherit torch.utils.data.Dataset. This class have cache feature - it can save images after applying "transform" to cache of pickle files and then read from this cache.

If you uncomment last line in main-dist-fsdp.py will be able to test code at one machine.

files

torch.yaml - Kubernetes yaml file.
a.sh - used to gather stdout and stderr of main-dist-fsdp.py to grap logs and execute this file
main-dist-fsdp.py - main file uses datasetmetareader.py and policies/
main-local.py - simple PyTorch training with validation and inference, uses datasetmetareader.py
datasetmetareader.py - function for prepareing Landmarks 2020 dataset
policies/ - help classes for FSDP auto_wrap_policy parameter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pytorch-fsdp-multi-node

files

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
policies		policies
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
a.sh		a.sh
datasetmetareader.py		datasetmetareader.py
main-dist-fsdp.py		main-dist-fsdp.py
main-local.py		main-local.py
torch.yaml		torch.yaml

License

Anoncheg1/pytorch-fsdp-multi-node

Folders and files

Latest commit

History

Repository files navigation

pytorch-fsdp-multi-node

files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages