Skip to content

A CLI tool for deployment of machine learning MPI training jobs on Kubernetes

License

Notifications You must be signed in to change notification settings

FFFFFaraway/farctl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License Go Reference Go Go Report Card

What is it

farctlis a simple CLI tool for machine learning engineer to deploy MPIJob in Kubernetes cluster without Kubernetes-related knowledge or manually deployment of yaml files. Imitated from this project, I reinvented the wheel for learning purpose again.

How to install

Requirements

Installation

go install github.com/FFFFFaraway/farctl@latest

How to use

Submit MPIJob

  1. You'll need to write deep learning code using horovod. For example here

  2. (Optional) You can upload the code to some public available platform like github, or gitlab

  3. Submit the job, for example:

    farctl submit test -i https://github.com/FFFFFaraway/sample-python-train.git -c "python generate_data.py" -c "python main.py" --gang -n 2
    • Test is the name of the submitted MPIJob
    • -i denote the url of git clone, or the local directory path (default: . )
    • -c denote the command as entry point. we can have multiple commands by using multiple -c
    • -gang denote we'll use gang scheduler. But it's needed a extra installation.
    • -n denote the number of workers to be created
    • Other options can be found by typing farctl submit -h
  4. Another example use local code directory, it will first copy . (current directory) recursively to all worker pods:

    farctl submit local-test -c "echo ab" -c "echo cd"

List MPIJob

farctl list

We could monitor the status of mpijobs here:

Namespace  Name  ReadyWorkers/Total  LauncherStatus  Age
farctl     test  1/2                 WaitingWorkers  1m17s
Namespace  Name  ReadyWorkers/Total  LauncherStatus  Age
farctl     test  2/2                 Running         2m3s

Get MPIJob Log

When the LauncherStatus become Running, we can access the log of the MPIJob:

farctl log test

Get applyed MPIJob Configuration

farctl get test

Delete MPIJob

farctl delete test

About

A CLI tool for deployment of machine learning MPI training jobs on Kubernetes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages