Skip to content

Latest commit

 

History

History
249 lines (187 loc) · 6.47 KB

README.md

File metadata and controls

249 lines (187 loc) · 6.47 KB

BatchAI Workshop

Batch AI provides managed infrastructure to help data scientists with cluster management and scheduling, scaling, and monitoring of AI jobs. Batch AI works on top of virtual machine scale sets and docker.

Batch AI can run training jobs in docker containers or directly on the compute nodes.

Batch AI

  • Cluster
  • Jobs
  • Azure File Share - stdout, stderr, may contain python scripts
  • Azure Blob Storage - python scripts, data

image

YOLO

You Only Look Once (YOLO) is a real-time object detection system. We will be running YOLOv3 on a single image with BatchAI. If you would like to run YOLO without a cluster you can follow the steps on the YOLO site.

Make the project

git clone https://github.com/pjreddie/darknet
cd darknet
make

Download the weights

wget https://pjreddie.com/media/files/yolov3.weights

Run YOLO

./darknet detect cfg/yolov3.cfg yolov3.weights data/dog.jpg

YOLOv3 should output something like:

  ...
  104 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256  1.595 BFLOPs
  105 conv    255  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 255  0.353 BFLOPs
  106 detection
Loading weights from cfg/yolov3.weights...Done!
data/dog.jpg: Predicted in 24.016015 seconds.
dog: 99%
truck: 92%
bicycle: 99%

Parallelizing Batch AI jobs

  • Python train and test scripts define the parallel strategy used, not Batch AI.

For example,

  • CNTK uses a synchronous data parallel training strategy
  • Tensorflow uses a asynchronous model parallel training strategy

Note

  • Make sure .sh scripts have LF endings - use dos2unix to fix
  • To enable faster communication between the nodes it´s necessary to use Intel MPI and have InfiniBand on the VM
  • NC24r (works with Intel MPI and InfiniBand) quota is 1 core by default in any subscription, so make quota increase requests early
  • There's no reset ssh-key for nodes
  • Do not put CMD in the dockerfile used by Batch AI. Since the container runs in detached mode, it will exit on CMD
  • Error messages within the container are not very descriptive
  • Clusters take a long time to provision and deallocate

Resources

Configure Azure CLI to use Batch AI

Set default subscription

az account set -s <subscription id>
az account list -o table

Create resource group

az group create -n <rg name> -l eastus

Create a storage account

az storage account create \
  -n <storage account name> \
  --sku Standard_LRS \
  -l eastus \
  -g <rg name>

Get storage account key

az storage account keys list \
  -n <storage account name> \
  -g <rg name> \
  --query "[0].value"

Create a file share

az storage share create \
  -n <share name> \
  --account-name <storage account name> \
  --account-key <storage account key>

Create a directory in your file share to hold python scripts

az storage directory create \
  -s <share name> \
  -n yolo \
  --account-name <storage account name> \
  --account-key <storage account key>

Upload python scripts to file share

az storage file upload \
  -s <share name> \
  --source <python script> \
  -p yolo \
  --account-name <storage account name> \
  --account-key <storage account key>

Create cluster

Create a cluster.json

Config parameters defined by ClusterCreateParameters in the batch ai swagger docs.

Create cluster with cluster.json config

az batchai cluster create \
  -n <cluster name> \
  -l eastus \
  -g <rg name> \
  -c cluster.json

Create cluster without cluster.json config

az batchai cluster create \
  -n <cluster name> \
  -g <rg name> \
  -l eastus \
  --storage-account-name <storage account name> \
  --storage-account-key <storage account key> \
  -i UbuntuDSVM \
  -s Standard_NC6 \
  --min 2 \
  --max 2 \
  --afs-name <share name> \
  --afs-mount-path external \
  -u $USER \
  -k ~/.ssh/id_rsa.pub \
  -p <password>

View Cluster Status

az batchai cluster show \
  -n <cluster name> \
  -g <rg name> \
  -o table

Create a job

Create job.json

az batchai job create \
  -g <rg name> \
  -l eastus \
  -n <job name> \
  -r <cluster name> \
  -c job.json

Monitor the job

az batchai job show \
  -n <job name> \
  -g <rg name> \
  -o table

Stream job file output

az batchai job stream-file \
  -j <job name> \
  -n stdout.txt \
  -d stdouterr \
  -g <rg name>

List ip and port of nodes in cluster

az batchai cluster list-nodes \
  -n <cluster name> \
  -g <rg name>

SSH into the VM

ssh <ip> -p <port>

$AZ_BATCHAI_MOUNT_ROOT is an environment variable set by Batch AI for each job, it's value depends on the image used for nodes creation. For example, on Ubuntu based images it's equal to /mnt/batch/tasks/shared/LS_root/mounts. You can cd to this directory and view the python scripts and logs.