farctl
is a simple CLI tool for machine learning engineer to deploy MPIJob in Kubernetes cluster without Kubernetes-related knowledge or manually deployment of yaml files. Imitated from this project, I reinvented the wheel for learning purpose again.
- Kubernetes cluster. You can use KIND to get a local cluster.
- kubectl: Install Tools | Kubernetes
- helm: Helm | Installing Helm
- Golang: Download and install - The Go Programming Language
- MPI Operator in your Kubernetes cluster
- gang Scheduler (Optional)
go install github.com/FFFFFaraway/farctl@latest
-
You'll need to write deep learning code using horovod. For example here
-
(Optional) You can upload the code to some public available platform like github, or gitlab
-
Submit the job, for example:
farctl submit test -i https://github.com/FFFFFaraway/sample-python-train.git -c "python generate_data.py" -c "python main.py" --gang -n 2
- Test is the name of the submitted MPIJob
- -i denote the url of git clone, or the local directory path (default: . )
- -c denote the command as entry point. we can have multiple commands by using multiple -c
- -gang denote we'll use gang scheduler. But it's needed a extra installation.
- -n denote the number of workers to be created
- Other options can be found by typing
farctl submit -h
-
Another example use local code directory, it will first copy . (current directory) recursively to all worker pods:
farctl submit local-test -c "echo ab" -c "echo cd"
farctl list
We could monitor the status of mpijobs here:
Namespace Name ReadyWorkers/Total LauncherStatus Age
farctl test 1/2 WaitingWorkers 1m17s
Namespace Name ReadyWorkers/Total LauncherStatus Age
farctl test 2/2 Running 2m3s
When the LauncherStatus
become Running
, we can access the log of the MPIJob:
farctl log test
farctl get test
farctl delete test