From 03d37d3639333ba5f802662bf23e8549689e8abe Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 08:30:15 -0700
Subject: [PATCH 1/6] distributed training tutorial

---
 configs/distributed-training/aws.yaml         |   8 +
 configs/distributed-training/gce.yaml         |   8 +
 templates/distributed-training/README.md      |  65 +++++++
 .../distributed-training/cluster_env.yaml     |  20 +++
 templates/distributed-training/pytorch.py     | 159 ++++++++++++++++++
 templates/distributed-training/tensorflow.py  | 132 +++++++++++++++
 6 files changed, 392 insertions(+)
 create mode 100644 configs/distributed-training/aws.yaml
 create mode 100644 configs/distributed-training/gce.yaml
 create mode 100644 templates/distributed-training/README.md
 create mode 100644 templates/distributed-training/cluster_env.yaml
 create mode 100644 templates/distributed-training/pytorch.py
 create mode 100644 templates/distributed-training/tensorflow.py

diff --git a/configs/distributed-training/aws.yaml b/configs/distributed-training/aws.yaml
new file mode 100644
index 000000000..0ccfc874d
--- /dev/null
+++ b/configs/distributed-training/aws.yaml
@@ -0,0 +1,8 @@
+head_node_type:
+  name: head_node_type
+  instance_type: m4.2xlarge
+worker_node_types:
+- name: gpu_worker
+  instance_type: g4dn.xlarge
+  min_workers: 1
+  max_workers: 3
diff --git a/configs/distributed-training/gce.yaml b/configs/distributed-training/gce.yaml
new file mode 100644
index 000000000..7fa075313
--- /dev/null
+++ b/configs/distributed-training/gce.yaml
@@ -0,0 +1,8 @@
+head_node_type:
+  name: head_node_type
+  instance_type: n2-standard-8
+worker_node_types:
+- name: gpu_worker
+  instance_type: n1-standard-4-nvidia-t4-16gb-1
+  min_workers: 1
+  max_workers: 3
diff --git a/templates/distributed-training/README.md b/templates/distributed-training/README.md
new file mode 100644
index 000000000..37e61ef61
--- /dev/null
+++ b/templates/distributed-training/README.md
@@ -0,0 +1,65 @@
+# Fine-Tuning LLMs on Anyscale with DeepSpeed
+
+In this application you will fine tune an LLM - GPTJ. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).
+
+The application can be used by developers and datascientists alike.  Developers can leverage simple APIs to run fine tuning jobs on the cluster while data scientists and machine learning engineers can dive in deeper to view the underlying code using familiar tools like JupyterLab notebooks or Visual Studio Code. **These tools enable you to easily adapt this example to use other similar models or your own data**.
+
+| App Details | Description |
+| ---------------------- | ----------- |
+| Summary | This app loads a pretrained GPTJ model from HuggingFace and fine tunes it on new text data.  |
+| Time to Run | Around 20-40 minutes to fine tune on all of the data. |
+| Minimum Compute Requirements | At least 1 GPU node. The default is 1 node (the head), and up to 15 worker nodes each with 1 NVIDIA T4 GPU. |
+| Cluster Environment | This template uses a docker image built on top of the latest Anyscale-provided Ray image using Python 3.10: [`anyscale/ray:latest-py310-cu118`](https://docs.anyscale.com/reference/base-images/overview). See the appendix below for more details. |
+
+## Using this application
+You can use the application via the CLI.  Navigate to the "terminal" once started and run the following command:
+```
+anyscale job submit -- python gptj_deepspeed_fine_tuning.py
+```
+Once submitted, you can navigate to the Job page and view the training progress with the Ray Dashboard. 
+![Ray Dashboard](https://github.com/anyscale/templates/releases/download/media/raydash.png)
+
+Note: This application is based on an example.  If you wish to go step by step and learn more please visit the [Ray docs tutorial](https://docs.ray.io/en/latest/ray-air/examples/gptj_deepspeed_fine_tuning.html)  
+
+### Next Steps
+
+#### Training on your own data: Modifying the Script 
+Once your application is ready and launched you may view the script with VSCode or Jupyter and modify to use your own data!  Read more about loading data with Ray [from your file store or database here](https://docs.ray.io/en/latest/data/loading-data.html).  Make sure the data you use has a similar structure to the [Shakespeare dataset we use.](https://huggingface.co/datasets/tiny_shakespeare)
+
+Modify the code under the 'loading data' section of the script to load your own fine-tuning dataset.
+
+Once the code is updated, run the same command as before:
+```
+anyscale job submit -- python gptj_deepspeed_fine_tuning.py
+```
+
+## Saving your model
+The fine tuning job automatically saves checkpoints during training in your [default mounted user storage](https://docs.anyscale.com/develop/workspaces/storage#user-storage).  You can view the model by navigating to "Files" viewer and selecting "User Storage".
+![Files](https://github.com/anyscale/templates/releases/download/media/files.png)
+
+Within 2 minutes you will be fine-tuning GPT-J on a corpus of Shakspeare data!  Let's dive in and explore the power of Anyscale and Ray together.
+
+
+## Appendix
+
+### Advanced - Workspaces and Configurations
+This application makes use of [Anyscale Workspaces](https://docs.anyscale.com/develop/workspaces/get-started) and Ray AIR (with the 🤗 Transformers integration) to fine-tune an LLM. Workspace is a fully managed development environment focused on developer productivity. With workspaces, ML practitioners and ML platform developers can quickly build distributed Ray applications and advance from research to development to production easily, all within single environment.
+
+To run this example, we've set up your Anyscale Workspace to have access to a head node with one GPU with 16 or more GBs of memory and 15 g4dn.4xlarge instances for the worker node group. This is done by defining a "compute configuration".  Learn more about [Compute Configs here](https://docs.anyscale.com/configure/compute-configs/overview).  It is easy to change your Compute Config once you launch by clicking "Workspace" and Editing the selection.  
+![Config](https://github.com/anyscale/templates/releases/download/media/edit.png)
+
+
+When you run the fine tuning job we execute a python script thats distributed with Ray as an [Anyscale Job](https://docs.anyscale.com/productionize/jobs/get-started).   
+
+### Advanced: Build off of this template's cluster environment
+#### Option 1: Build a new cluster environment on Anyscale
+You'll find a cluster_env.yaml file in the working directory of the template. Feel free to modify this to include more requirements, then follow [this](https://docs.anyscale.com/configure/dependency-management/cluster-environments#creating-a-cluster-environment) guide to use the Anyscale CLI to create a new cluster environment.
+
+Finally, update your workspace's cluster environment to this new one after it's done building.
+
+#### Option 2: Build a new docker image with your own infrastructure
+Use the following docker pull command if you want to manually build a new Docker image based off of this one.
+
+```bash
+docker pull us-docker.pkg.dev/anyscale-workspace-templates/workspace-templates/fine-tune-gptj:latest
+```
\ No newline at end of file
diff --git a/templates/distributed-training/cluster_env.yaml b/templates/distributed-training/cluster_env.yaml
new file mode 100644
index 000000000..209838289
--- /dev/null
+++ b/templates/distributed-training/cluster_env.yaml
@@ -0,0 +1,20 @@
+# See https://hub.docker.com/r/anyscale/ray for full list of
+# available Ray, Python, and CUDA versions.
+base_image: anyscale/ray-ml:nightly-py310-gpu
+
+env_vars: {}
+
+debian_packages: []
+
+python:
+  pip_packages:
+    - accelerate==0.16.0
+    - transformers==4.26.0
+    - torch==2.0.1
+    - deepspeed==0.9.2
+    - evaluate==0.4.0
+    - datasets==2.13.1
+
+  conda_packages: []
+
+post_build_cmds: []
\ No newline at end of file
diff --git a/templates/distributed-training/pytorch.py b/templates/distributed-training/pytorch.py
new file mode 100644
index 000000000..1ccabebed
--- /dev/null
+++ b/templates/distributed-training/pytorch.py
@@ -0,0 +1,159 @@
+import argparse
+from typing import Dict
+from ray.air import session
+
+import torch
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import datasets
+from torchvision.transforms import ToTensor
+
+import ray.train as train
+from ray.train.torch import TorchTrainer
+from ray.air.config import ScalingConfig
+
+# Download training data from open datasets.
+training_data = datasets.FashionMNIST(
+    root="~/data",
+    train=True,
+    download=True,
+    transform=ToTensor(),
+)
+
+# Download test data from open datasets.
+test_data = datasets.FashionMNIST(
+    root="~/data",
+    train=False,
+    download=True,
+    transform=ToTensor(),
+)
+
+
+# Define model
+class NeuralNetwork(nn.Module):
+    def __init__(self):
+        super(NeuralNetwork, self).__init__()
+        self.flatten = nn.Flatten()
+        self.linear_relu_stack = nn.Sequential(
+            nn.Linear(28 * 28, 512),
+            nn.ReLU(),
+            nn.Linear(512, 512),
+            nn.ReLU(),
+            nn.Linear(512, 10),
+            nn.ReLU(),
+        )
+
+    def forward(self, x):
+        x = self.flatten(x)
+        logits = self.linear_relu_stack(x)
+        return logits
+
+
+def train_epoch(dataloader, model, loss_fn, optimizer):
+    size = len(dataloader.dataset) // session.get_world_size()
+    model.train()
+    for batch, (X, y) in enumerate(dataloader):
+        # Compute prediction error
+        pred = model(X)
+        loss = loss_fn(pred, y)
+
+        # Backpropagation
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+
+        if batch % 100 == 0:
+            loss, current = loss.item(), batch * len(X)
+            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
+
+
+def validate_epoch(dataloader, model, loss_fn):
+    size = len(dataloader.dataset) // session.get_world_size()
+    num_batches = len(dataloader)
+    model.eval()
+    test_loss, correct = 0, 0
+    with torch.no_grad():
+        for X, y in dataloader:
+            pred = model(X)
+            test_loss += loss_fn(pred, y).item()
+            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
+    test_loss /= num_batches
+    correct /= size
+    print(
+        f"Test Error: \n "
+        f"Accuracy: {(100 * correct):>0.1f}%, "
+        f"Avg loss: {test_loss:>8f} \n"
+    )
+    return test_loss
+
+
+def train_func(config: Dict):
+    batch_size = config["batch_size"]
+    lr = config["lr"]
+    epochs = config["epochs"]
+
+    worker_batch_size = batch_size // session.get_world_size()
+
+    # Create data loaders.
+    train_dataloader = DataLoader(training_data, batch_size=worker_batch_size)
+    test_dataloader = DataLoader(test_data, batch_size=worker_batch_size)
+
+    train_dataloader = train.torch.prepare_data_loader(train_dataloader)
+    test_dataloader = train.torch.prepare_data_loader(test_dataloader)
+
+    # Create model.
+    model = NeuralNetwork()
+    model = train.torch.prepare_model(model)
+
+    loss_fn = nn.CrossEntropyLoss()
+    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
+
+    for _ in range(epochs):
+        train_epoch(train_dataloader, model, loss_fn, optimizer)
+        loss = validate_epoch(test_dataloader, model, loss_fn)
+        session.report(dict(loss=loss))
+
+
+def train_fashion_mnist(num_workers=2, use_gpu=False):
+    trainer = TorchTrainer(
+        train_loop_per_worker=train_func,
+        train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
+        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
+    )
+    result = trainer.fit()
+    print(f"Last result: {result.metrics}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--address", required=False, type=str, help="the address to use for Ray"
+    )
+    parser.add_argument(
+        "--num-workers",
+        "-n",
+        type=int,
+        default=2,
+        help="Sets number of workers for training.",
+    )
+    parser.add_argument(
+        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
+    )
+    parser.add_argument(
+        "--smoke-test",
+        action="store_true",
+        default=False,
+        help="Finish quickly for testing.",
+    )
+
+    args, _ = parser.parse_known_args()
+
+    import ray
+
+    if args.smoke_test:
+        # 2 workers + 1 for trainer.
+        ray.init(num_cpus=3)
+        train_fashion_mnist()
+    else:
+        ray.init(address=args.address)
+        train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)
\ No newline at end of file
diff --git a/templates/distributed-training/tensorflow.py b/templates/distributed-training/tensorflow.py
new file mode 100644
index 000000000..26722c33d
--- /dev/null
+++ b/templates/distributed-training/tensorflow.py
@@ -0,0 +1,132 @@
+# This example showcases how to use Tensorflow with Ray Train.
+# Original code:
+# https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
+import argparse
+from filelock import FileLock
+import json
+import os
+
+import numpy as np
+from ray.air.result import Result
+import tensorflow as tf
+
+from ray.train.tensorflow import TensorflowTrainer
+from ray.air.integrations.keras import ReportCheckpointCallback
+from ray.air.config import ScalingConfig
+
+
+def mnist_dataset(batch_size: int) -> tf.data.Dataset:
+    with FileLock(os.path.expanduser("~/.mnist_lock")):
+        (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
+    # The `x` arrays are in uint8 and have values in the [0, 255] range.
+    # You need to convert them to float32 with values in the [0, 1] range.
+    x_train = x_train / np.float32(255)
+    y_train = y_train.astype(np.int64)
+    train_dataset = (
+        tf.data.Dataset.from_tensor_slices((x_train, y_train))
+        .shuffle(60000)
+        .repeat()
+        .batch(batch_size)
+    )
+    return train_dataset
+
+
+def build_cnn_model() -> tf.keras.Model:
+    model = tf.keras.Sequential(
+        [
+            tf.keras.Input(shape=(28, 28)),
+            tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
+            tf.keras.layers.Conv2D(32, 3, activation="relu"),
+            tf.keras.layers.Flatten(),
+            tf.keras.layers.Dense(128, activation="relu"),
+            tf.keras.layers.Dense(10),
+        ]
+    )
+    return model
+
+
+def train_func(config: dict):
+    per_worker_batch_size = config.get("batch_size", 64)
+    epochs = config.get("epochs", 3)
+    steps_per_epoch = config.get("steps_per_epoch", 70)
+
+    tf_config = json.loads(os.environ["TF_CONFIG"])
+    num_workers = len(tf_config["cluster"]["worker"])
+
+    strategy = tf.distribute.MultiWorkerMirroredStrategy()
+
+    global_batch_size = per_worker_batch_size * num_workers
+    multi_worker_dataset = mnist_dataset(global_batch_size)
+
+    with strategy.scope():
+        # Model building/compiling need to be within `strategy.scope()`.
+        multi_worker_model = build_cnn_model()
+        learning_rate = config.get("lr", 0.001)
+        multi_worker_model.compile(
+            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+            optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
+            metrics=["accuracy"],
+        )
+
+    history = multi_worker_model.fit(
+        multi_worker_dataset,
+        epochs=epochs,
+        steps_per_epoch=steps_per_epoch,
+        callbacks=[ReportCheckpointCallback()],
+    )
+    results = history.history
+    return results
+
+
+def train_tensorflow_mnist(
+    num_workers: int = 2, use_gpu: bool = False, epochs: int = 4
+) -> Result:
+    config = {"lr": 1e-3, "batch_size": 64, "epochs": epochs}
+    trainer = TensorflowTrainer(
+        train_loop_per_worker=train_func,
+        train_loop_config=config,
+        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
+    )
+    results = trainer.fit()
+    return results
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--address", required=False, type=str, help="the address to use for Ray"
+    )
+    parser.add_argument(
+        "--num-workers",
+        "-n",
+        type=int,
+        default=2,
+        help="Sets number of workers for training.",
+    )
+    parser.add_argument(
+        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
+    )
+    parser.add_argument(
+        "--epochs", type=int, default=3, help="Number of epochs to train for."
+    )
+    parser.add_argument(
+        "--smoke-test",
+        action="store_true",
+        default=False,
+        help="Finish quickly for testing.",
+    )
+
+    args, _ = parser.parse_known_args()
+
+    import ray
+
+    if args.smoke_test:
+        # 2 workers, 1 for trainer, 1 for datasets
+        num_gpus = args.num_workers if args.use_gpu else 0
+        ray.init(num_cpus=4, num_gpus=num_gpus)
+        train_tensorflow_mnist(num_workers=2, use_gpu=args.use_gpu)
+    else:
+        ray.init(address=args.address)
+        train_tensorflow_mnist(
+            num_workers=args.num_workers, use_gpu=args.use_gpu, epochs=args.epochs
+        )
\ No newline at end of file

From 36542b48c9814451e73a3d283e31d91a9916eee1 Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 12:25:27 -0700
Subject: [PATCH 2/6] updating

---
 templates/distributed-training/README.md     | 93 +++++++++++++-------
 templates/distributed-training/pytorch.py    | 17 ++--
 templates/distributed-training/tensorflow.py | 25 ++----
 3 files changed, 72 insertions(+), 63 deletions(-)

diff --git a/templates/distributed-training/README.md b/templates/distributed-training/README.md
index 37e61ef61..f79d0dd54 100644
--- a/templates/distributed-training/README.md
+++ b/templates/distributed-training/README.md
@@ -1,55 +1,84 @@
-# Fine-Tuning LLMs on Anyscale with DeepSpeed
+# Distributed Training With PyTorch and TensorFlow on Fashion MNIST 
 
-In this application you will fine tune an LLM - GPTJ. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).
+In this tutorial you will train models with Fashion MNIST dataset using both PyTorch and TensorFlow.
 
-The application can be used by developers and datascientists alike.  Developers can leverage simple APIs to run fine tuning jobs on the cluster while data scientists and machine learning engineers can dive in deeper to view the underlying code using familiar tools like JupyterLab notebooks or Visual Studio Code. **These tools enable you to easily adapt this example to use other similar models or your own data**.
 
-| App Details | Description |
+| Details | Description |
 | ---------------------- | ----------- |
-| Summary | This app loads a pretrained GPTJ model from HuggingFace and fine tunes it on new text data.  |
-| Time to Run | Around 20-40 minutes to fine tune on all of the data. |
-| Minimum Compute Requirements | At least 1 GPU node. The default is 1 node (the head), and up to 15 worker nodes each with 1 NVIDIA T4 GPU. |
-| Cluster Environment | This template uses a docker image built on top of the latest Anyscale-provided Ray image using Python 3.10: [`anyscale/ray:latest-py310-cu118`](https://docs.anyscale.com/reference/base-images/overview). See the appendix below for more details. |
+| Summary | This tutorial demonstrates how to set up distributed training with PyTorch or TensorFlow using the MNIST dataset and run on Anyscale|
+| Time to Run | Less than 5 minutes |
+| Compute Requirements | We recommend at least 1 GPU node. The default will scale up to 3 worker nodes each with 1 NVIDIA T4 GPU. |
+| Cluster Environment | This template uses a docker image built on top of the latest Anyscale-provided Ray image using Python 3.10, which comes with PyTorch and TensorFlow: [`anyscale/ray-ml:2.5.1-py310-gpu`](https://docs.anyscale.com/reference/base-images/overview). See the appendix below for more details. |
 
-## Using this application
-You can use the application via the CLI.  Navigate to the "terminal" once started and run the following command:
+## Running the tutorial
+### Background
+The tutorial is run from an Anyscale Workspace.  Anyscale Workspaces is a fully managed development environment that enables ML practitioners to build distributed Ray applications and advance from research to development to production easily, all within a single environment. The Workspace provides developer friendly tools like VSCode and Jupyter backed by a remote scaling Ray Cluster for development.
+
+Anyscale requires 2 configs to start up a Workspace Cluster:
+1. A cluster environment that handles dependencies.
+2. A compute configuration that determines how many nodes of each type to bring up. This also configures how many nodes are available for autoscaling.
+
+Those have been set by default in this tutorial but can be edited and updated if needed and is covered in the appendix.
+
+### Run
+There are two python scripts available with this tutorial - one for PyTorch and one for TensorFlow.  You can run these training scripts directly from the workspace terminal.
+
+To run the PyTorch example:
+```bash
+python pytorch.py
 ```
-anyscale job submit -- python gptj_deepspeed_fine_tuning.py
+And the TensorFlow version:
+```bash
+python tensorflow.py
 ```
-Once submitted, you can navigate to the Job page and view the training progress with the Ray Dashboard. 
-![Ray Dashboard](https://github.com/anyscale/templates/releases/download/media/raydash.png)
 
-Note: This application is based on an example.  If you wish to go step by step and learn more please visit the [Ray docs tutorial](https://docs.ray.io/en/latest/ray-air/examples/gptj_deepspeed_fine_tuning.html)  
+You'll see training iterations and metrics as the training executes.
 
-### Next Steps
+### Monitor
+After launching the script, you can look at the Ray dashboard. It can be accessed from the Workspace home page and enables users to track things like CPU/GPU utilization, GPU memory usage, remote task statuses, and more!
 
-#### Training on your own data: Modifying the Script 
-Once your application is ready and launched you may view the script with VSCode or Jupyter and modify to use your own data!  Read more about loading data with Ray [from your file store or database here](https://docs.ray.io/en/latest/data/loading-data.html).  Make sure the data you use has a similar structure to the [Shakespeare dataset we use.](https://huggingface.co/datasets/tiny_shakespeare)
+![Dash](https://github.com/anyscale/templates/releases/download/media/workspacedash.png)
 
-Modify the code under the 'loading data' section of the script to load your own fine-tuning dataset.
+[See here for more extensive documentation on the dashboard.](https://docs.ray.io/en/latest/ray-observability/getting-started.html)
 
-Once the code is updated, run the same command as before:
-```
-anyscale job submit -- python gptj_deepspeed_fine_tuning.py
-```
+### Model Saving
+The model will be saved in the Anyscale Artifact Store, which is automatically set up and configured with your Anyscale deployment.
 
-## Saving your model
-The fine tuning job automatically saves checkpoints during training in your [default mounted user storage](https://docs.anyscale.com/develop/workspaces/storage#user-storage).  You can view the model by navigating to "Files" viewer and selecting "User Storage".
-![Files](https://github.com/anyscale/templates/releases/download/media/files.png)
+For every Anyscale Cloud, a default object storage bucket is configured during the Cloud deployment. All the Workspaces, Jobs, and Services Clusters within an Anyscale Cloud have permission to read and write to its default bucket.
 
-Within 2 minutes you will be fine-tuning GPT-J on a corpus of Shakspeare data!  Let's dive in and explore the power of Anyscale and Ray together.
+Use the following environment variables to access the default bucket:
 
+ANYSCALE_CLOUD_STORAGE_BUCKET: the name of the bucket.
+ANYSCALE_CLOUD_STORAGE_BUCKET_REGION: the region of the bucket.
+ANYSCALE_ARTIFACT_STORAGE: the URI to the pre-generated folder for storing your artifacts while keeping them separate them from Anyscale-generated ones.
+AWS: s3://<org_id>/<cloud_id>/artifact_storage/
+GCP: gs://<org_id>/<cloud_id>/artifact_storage/
 
-## Appendix
+### Submit as Anyscale Production Job
+From within your Anyscale Workspace, you can run your script as an Anyscale Job. This might be useful if you want to run things in production and have a long running job. You can test that each Anyscale Job will spin up its own cluster (with the same compute config and cluster environment as the Workspace) and run the script.  The Anyscale Job will automatically retry in event of failure and provides monitoring via the Ray Dashboard and Grafana. 
 
-### Advanced - Workspaces and Configurations
-This application makes use of [Anyscale Workspaces](https://docs.anyscale.com/develop/workspaces/get-started) and Ray AIR (with the 🤗 Transformers integration) to fine-tune an LLM. Workspace is a fully managed development environment focused on developer productivity. With workspaces, ML practitioners and ML platform developers can quickly build distributed Ray applications and advance from research to development to production easily, all within single environment.
+To submit as a Production Job you can run:
 
-To run this example, we've set up your Anyscale Workspace to have access to a head node with one GPU with 16 or more GBs of memory and 15 g4dn.4xlarge instances for the worker node group. This is done by defining a "compute configuration".  Learn more about [Compute Configs here](https://docs.anyscale.com/configure/compute-configs/overview).  It is easy to change your Compute Config once you launch by clicking "Workspace" and Editing the selection.  
-![Config](https://github.com/anyscale/templates/releases/download/media/edit.png)
+```bash
+anyscale job submit -- python pytorch.py
+```
+
+[You can learn more about Anyscale Jobs here.](https://docs.anyscale.com/productionize/jobs/get-started)
+
+### Next Steps
+
+#### Training on your own data: Modifying the Script 
+You can easily modify the script in VSCode or Jupyter to use your own data, add data pre-processing logic, or change the model architecture!  Read more about loading data with Ray [from your file store or database here](https://docs.ray.io/en/latest/data/loading-data.html). 
 
+Once the code is updated, run the same command as before to kick off your training job:
+```bash
+anyscale job submit -- python pytorch.py
+```
 
-When you run the fine tuning job we execute a python script thats distributed with Ray as an [Anyscale Job](https://docs.anyscale.com/productionize/jobs/get-started).   
+## Appendix
+### Advanced - Workspaces and Configurations
+To run this example, we've set up your Anyscale Workspace to have access to a head node with CPUs and woker nodes with GPUs.This is done by defining a "compute configuration".  Learn more about [Compute Configs here](https://docs.anyscale.com/configure/compute-configs/overview).  It is easy to change your Compute Config once you launch by clicking "Workspace" and Editing the selection.  
+![Config](https://github.com/anyscale/templates/releases/download/media/edit.png)
 
 ### Advanced: Build off of this template's cluster environment
 #### Option 1: Build a new cluster environment on Anyscale
diff --git a/templates/distributed-training/pytorch.py b/templates/distributed-training/pytorch.py
index 1ccabebed..0e81f9488 100644
--- a/templates/distributed-training/pytorch.py
+++ b/templates/distributed-training/pytorch.py
@@ -114,7 +114,7 @@ def train_func(config: Dict):
         session.report(dict(loss=loss))
 
 
-def train_fashion_mnist(num_workers=2, use_gpu=False):
+def train_fashion_mnist(num_workers=2, use_gpu=True):
     trainer = TorchTrainer(
         train_loop_per_worker=train_func,
         train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
@@ -126,9 +126,7 @@ def train_fashion_mnist(num_workers=2, use_gpu=False):
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--address", required=False, type=str, help="the address to use for Ray"
-    )
+ 
     parser.add_argument(
         "--num-workers",
         "-n",
@@ -137,7 +135,7 @@ def train_fashion_mnist(num_workers=2, use_gpu=False):
         help="Sets number of workers for training.",
     )
     parser.add_argument(
-        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
+        "--use-gpu", action="store_true", default=True, help="Enables GPU training"
     )
     parser.add_argument(
         "--smoke-test",
@@ -150,10 +148,5 @@ def train_fashion_mnist(num_workers=2, use_gpu=False):
 
     import ray
 
-    if args.smoke_test:
-        # 2 workers + 1 for trainer.
-        ray.init(num_cpus=3)
-        train_fashion_mnist()
-    else:
-        ray.init(address=args.address)
-        train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)
\ No newline at end of file
+    
+    train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu)
\ No newline at end of file
diff --git a/templates/distributed-training/tensorflow.py b/templates/distributed-training/tensorflow.py
index 26722c33d..992bcd303 100644
--- a/templates/distributed-training/tensorflow.py
+++ b/templates/distributed-training/tensorflow.py
@@ -79,7 +79,7 @@ def train_func(config: dict):
 
 
 def train_tensorflow_mnist(
-    num_workers: int = 2, use_gpu: bool = False, epochs: int = 4
+    num_workers: int = 2, use_gpu: bool = True, epochs: int = 4
 ) -> Result:
     config = {"lr": 1e-3, "batch_size": 64, "epochs": epochs}
     trainer = TensorflowTrainer(
@@ -93,9 +93,7 @@ def train_tensorflow_mnist(
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--address", required=False, type=str, help="the address to use for Ray"
-    )
+   
     parser.add_argument(
         "--num-workers",
         "-n",
@@ -104,29 +102,18 @@ def train_tensorflow_mnist(
         help="Sets number of workers for training.",
     )
     parser.add_argument(
-        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
+        "--use-gpu", action="store_true", default=True, help="Enables GPU training"
     )
     parser.add_argument(
         "--epochs", type=int, default=3, help="Number of epochs to train for."
     )
-    parser.add_argument(
-        "--smoke-test",
-        action="store_true",
-        default=False,
-        help="Finish quickly for testing.",
-    )
+    
 
     args, _ = parser.parse_known_args()
 
     import ray
 
-    if args.smoke_test:
-        # 2 workers, 1 for trainer, 1 for datasets
-        num_gpus = args.num_workers if args.use_gpu else 0
-        ray.init(num_cpus=4, num_gpus=num_gpus)
-        train_tensorflow_mnist(num_workers=2, use_gpu=args.use_gpu)
-    else:
-        ray.init(address=args.address)
-        train_tensorflow_mnist(
+    
+    train_tensorflow_mnist(
             num_workers=args.num_workers, use_gpu=args.use_gpu, epochs=args.epochs
         )
\ No newline at end of file

From 5fe2ca28345c142c8d43658c6a8d3060033fcbb9 Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 13:17:07 -0700
Subject: [PATCH 3/6] finalizing training template

---
 templates/distributed-training/README.md     |  16 +--
 templates/distributed-training/pytorch.py    |   4 +
 templates/distributed-training/tensorflow.py | 119 -------------------
 3 files changed, 10 insertions(+), 129 deletions(-)
 delete mode 100644 templates/distributed-training/tensorflow.py

diff --git a/templates/distributed-training/README.md b/templates/distributed-training/README.md
index f79d0dd54..6be69fb0c 100644
--- a/templates/distributed-training/README.md
+++ b/templates/distributed-training/README.md
@@ -1,14 +1,14 @@
-# Distributed Training With PyTorch and TensorFlow on Fashion MNIST 
+# Distributed Training With PyTorch on Fashion MNIST 
 
-In this tutorial you will train models with Fashion MNIST dataset using both PyTorch and TensorFlow.
+In this tutorial you will train models on the Fashion MNIST dataset using PyTorch.
 
 
 | Details | Description |
 | ---------------------- | ----------- |
-| Summary | This tutorial demonstrates how to set up distributed training with PyTorch or TensorFlow using the MNIST dataset and run on Anyscale|
+| Summary | This tutorial demonstrates how to set up distributed training with PyTorch using the MNIST dataset and run on Anyscale|
 | Time to Run | Less than 5 minutes |
 | Compute Requirements | We recommend at least 1 GPU node. The default will scale up to 3 worker nodes each with 1 NVIDIA T4 GPU. |
-| Cluster Environment | This template uses a docker image built on top of the latest Anyscale-provided Ray image using Python 3.10, which comes with PyTorch and TensorFlow: [`anyscale/ray-ml:2.5.1-py310-gpu`](https://docs.anyscale.com/reference/base-images/overview). See the appendix below for more details. |
+| Cluster Environment | This template uses a docker image built on top of the latest Anyscale-provided Ray image using Python 3.10, which comes with PyTorch: [`anyscale/ray-ml:2.5.1-py310-gpu`](https://docs.anyscale.com/reference/base-images/overview). See the appendix below for more details. |
 
 ## Running the tutorial
 ### Background
@@ -21,16 +21,12 @@ Anyscale requires 2 configs to start up a Workspace Cluster:
 Those have been set by default in this tutorial but can be edited and updated if needed and is covered in the appendix.
 
 ### Run
-There are two python scripts available with this tutorial - one for PyTorch and one for TensorFlow.  You can run these training scripts directly from the workspace terminal.
+The tutorial includes a pyton script with the code to do distributed training.  You can execute this training script directly from the workspace terminal.
 
-To run the PyTorch example:
+To run:
 ```bash
 python pytorch.py
 ```
-And the TensorFlow version:
-```bash
-python tensorflow.py
-```
 
 You'll see training iterations and metrics as the training executes.
 
diff --git a/templates/distributed-training/pytorch.py b/templates/distributed-training/pytorch.py
index 0e81f9488..ef4bfeebd 100644
--- a/templates/distributed-training/pytorch.py
+++ b/templates/distributed-training/pytorch.py
@@ -11,6 +11,8 @@
 import ray.train as train
 from ray.train.torch import TorchTrainer
 from ray.air.config import ScalingConfig
+from ray.air.config import RunConfig
+
 
 # Download training data from open datasets.
 training_data = datasets.FashionMNIST(
@@ -119,6 +121,8 @@ def train_fashion_mnist(num_workers=2, use_gpu=True):
         train_loop_per_worker=train_func,
         train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
         scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
+        run_config=RunConfig(storage_path="$ANYSCALE_ARTIFACT_STORAGE")
+
     )
     result = trainer.fit()
     print(f"Last result: {result.metrics}")
diff --git a/templates/distributed-training/tensorflow.py b/templates/distributed-training/tensorflow.py
deleted file mode 100644
index 992bcd303..000000000
--- a/templates/distributed-training/tensorflow.py
+++ /dev/null
@@ -1,119 +0,0 @@
-# This example showcases how to use Tensorflow with Ray Train.
-# Original code:
-# https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
-import argparse
-from filelock import FileLock
-import json
-import os
-
-import numpy as np
-from ray.air.result import Result
-import tensorflow as tf
-
-from ray.train.tensorflow import TensorflowTrainer
-from ray.air.integrations.keras import ReportCheckpointCallback
-from ray.air.config import ScalingConfig
-
-
-def mnist_dataset(batch_size: int) -> tf.data.Dataset:
-    with FileLock(os.path.expanduser("~/.mnist_lock")):
-        (x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
-    # The `x` arrays are in uint8 and have values in the [0, 255] range.
-    # You need to convert them to float32 with values in the [0, 1] range.
-    x_train = x_train / np.float32(255)
-    y_train = y_train.astype(np.int64)
-    train_dataset = (
-        tf.data.Dataset.from_tensor_slices((x_train, y_train))
-        .shuffle(60000)
-        .repeat()
-        .batch(batch_size)
-    )
-    return train_dataset
-
-
-def build_cnn_model() -> tf.keras.Model:
-    model = tf.keras.Sequential(
-        [
-            tf.keras.Input(shape=(28, 28)),
-            tf.keras.layers.Reshape(target_shape=(28, 28, 1)),
-            tf.keras.layers.Conv2D(32, 3, activation="relu"),
-            tf.keras.layers.Flatten(),
-            tf.keras.layers.Dense(128, activation="relu"),
-            tf.keras.layers.Dense(10),
-        ]
-    )
-    return model
-
-
-def train_func(config: dict):
-    per_worker_batch_size = config.get("batch_size", 64)
-    epochs = config.get("epochs", 3)
-    steps_per_epoch = config.get("steps_per_epoch", 70)
-
-    tf_config = json.loads(os.environ["TF_CONFIG"])
-    num_workers = len(tf_config["cluster"]["worker"])
-
-    strategy = tf.distribute.MultiWorkerMirroredStrategy()
-
-    global_batch_size = per_worker_batch_size * num_workers
-    multi_worker_dataset = mnist_dataset(global_batch_size)
-
-    with strategy.scope():
-        # Model building/compiling need to be within `strategy.scope()`.
-        multi_worker_model = build_cnn_model()
-        learning_rate = config.get("lr", 0.001)
-        multi_worker_model.compile(
-            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
-            optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate),
-            metrics=["accuracy"],
-        )
-
-    history = multi_worker_model.fit(
-        multi_worker_dataset,
-        epochs=epochs,
-        steps_per_epoch=steps_per_epoch,
-        callbacks=[ReportCheckpointCallback()],
-    )
-    results = history.history
-    return results
-
-
-def train_tensorflow_mnist(
-    num_workers: int = 2, use_gpu: bool = True, epochs: int = 4
-) -> Result:
-    config = {"lr": 1e-3, "batch_size": 64, "epochs": epochs}
-    trainer = TensorflowTrainer(
-        train_loop_per_worker=train_func,
-        train_loop_config=config,
-        scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
-    )
-    results = trainer.fit()
-    return results
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-   
-    parser.add_argument(
-        "--num-workers",
-        "-n",
-        type=int,
-        default=2,
-        help="Sets number of workers for training.",
-    )
-    parser.add_argument(
-        "--use-gpu", action="store_true", default=True, help="Enables GPU training"
-    )
-    parser.add_argument(
-        "--epochs", type=int, default=3, help="Number of epochs to train for."
-    )
-    
-
-    args, _ = parser.parse_known_args()
-
-    import ray
-
-    
-    train_tensorflow_mnist(
-            num_workers=args.num_workers, use_gpu=args.use_gpu, epochs=args.epochs
-        )
\ No newline at end of file

From fe043be618e283da946796f57e2d45ffa6dd4b3b Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 13:40:42 -0700
Subject: [PATCH 4/6] final updates

---
 templates/distributed-training/README.md      | 18 -----------------
 .../distributed-training/cluster_env.yaml     | 20 -------------------
 2 files changed, 38 deletions(-)
 delete mode 100644 templates/distributed-training/cluster_env.yaml

diff --git a/templates/distributed-training/README.md b/templates/distributed-training/README.md
index 6be69fb0c..f72c3446c 100644
--- a/templates/distributed-training/README.md
+++ b/templates/distributed-training/README.md
@@ -69,22 +69,4 @@ You can easily modify the script in VSCode or Jupyter to use your own data, add
 Once the code is updated, run the same command as before to kick off your training job:
 ```bash
 anyscale job submit -- python pytorch.py
-```
-
-## Appendix
-### Advanced - Workspaces and Configurations
-To run this example, we've set up your Anyscale Workspace to have access to a head node with CPUs and woker nodes with GPUs.This is done by defining a "compute configuration".  Learn more about [Compute Configs here](https://docs.anyscale.com/configure/compute-configs/overview).  It is easy to change your Compute Config once you launch by clicking "Workspace" and Editing the selection.  
-![Config](https://github.com/anyscale/templates/releases/download/media/edit.png)
-
-### Advanced: Build off of this template's cluster environment
-#### Option 1: Build a new cluster environment on Anyscale
-You'll find a cluster_env.yaml file in the working directory of the template. Feel free to modify this to include more requirements, then follow [this](https://docs.anyscale.com/configure/dependency-management/cluster-environments#creating-a-cluster-environment) guide to use the Anyscale CLI to create a new cluster environment.
-
-Finally, update your workspace's cluster environment to this new one after it's done building.
-
-#### Option 2: Build a new docker image with your own infrastructure
-Use the following docker pull command if you want to manually build a new Docker image based off of this one.
-
-```bash
-docker pull us-docker.pkg.dev/anyscale-workspace-templates/workspace-templates/fine-tune-gptj:latest
 ```
\ No newline at end of file
diff --git a/templates/distributed-training/cluster_env.yaml b/templates/distributed-training/cluster_env.yaml
deleted file mode 100644
index 209838289..000000000
--- a/templates/distributed-training/cluster_env.yaml
+++ /dev/null
@@ -1,20 +0,0 @@
-# See https://hub.docker.com/r/anyscale/ray for full list of
-# available Ray, Python, and CUDA versions.
-base_image: anyscale/ray-ml:nightly-py310-gpu
-
-env_vars: {}
-
-debian_packages: []
-
-python:
-  pip_packages:
-    - accelerate==0.16.0
-    - transformers==4.26.0
-    - torch==2.0.1
-    - deepspeed==0.9.2
-    - evaluate==0.4.0
-    - datasets==2.13.1
-
-  conda_packages: []
-
-post_build_cmds: []
\ No newline at end of file

From ba705af7d43dcd28640cdd6e17e5e3a58ad786e1 Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 14:26:51 -0700
Subject: [PATCH 5/6] artifact storage

---
 templates/distributed-training/pytorch.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/templates/distributed-training/pytorch.py b/templates/distributed-training/pytorch.py
index ef4bfeebd..d4b3c43a3 100644
--- a/templates/distributed-training/pytorch.py
+++ b/templates/distributed-training/pytorch.py
@@ -1,6 +1,7 @@
 import argparse
 from typing import Dict
 from ray.air import session
+import os
 
 import torch
 from torch import nn
@@ -115,13 +116,15 @@ def train_func(config: Dict):
         loss = validate_epoch(test_dataloader, model, loss_fn)
         session.report(dict(loss=loss))
 
+artifact_storage_path = os.environ['ANYSCALE_ARTIFACT_STORAGE'] +'/pytorch-tutorial'
+print(f"Storing artifacts in {artifact_storage_path}")
 
 def train_fashion_mnist(num_workers=2, use_gpu=True):
     trainer = TorchTrainer(
         train_loop_per_worker=train_func,
         train_loop_config={"lr": 1e-3, "batch_size": 64, "epochs": 4},
         scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
-        run_config=RunConfig(storage_path="$ANYSCALE_ARTIFACT_STORAGE")
+        run_config=RunConfig(storage_path=artifact_storage_path)
 
     )
     result = trainer.fit()

From f17c34ce0f0d31d6a1281737e23ad7381d003dac Mon Sep 17 00:00:00 2001
From: mjconnor <mattc@anyscale.com>
Date: Wed, 12 Jul 2023 14:38:28 -0700
Subject: [PATCH 6/6] more info on seeing model in artifact store

---
 templates/distributed-training/README.md | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/templates/distributed-training/README.md b/templates/distributed-training/README.md
index f72c3446c..df012668c 100644
--- a/templates/distributed-training/README.md
+++ b/templates/distributed-training/README.md
@@ -44,11 +44,21 @@ For every Anyscale Cloud, a default object storage bucket is configured during t
 
 Use the following environment variables to access the default bucket:
 
-ANYSCALE_CLOUD_STORAGE_BUCKET: the name of the bucket.
-ANYSCALE_CLOUD_STORAGE_BUCKET_REGION: the region of the bucket.
-ANYSCALE_ARTIFACT_STORAGE: the URI to the pre-generated folder for storing your artifacts while keeping them separate them from Anyscale-generated ones.
-AWS: s3://<org_id>/<cloud_id>/artifact_storage/
-GCP: gs://<org_id>/<cloud_id>/artifact_storage/
+1. ANYSCALE_CLOUD_STORAGE_BUCKET: the name of the bucket.
+2. ANYSCALE_CLOUD_STORAGE_BUCKET_REGION: the region of the bucket.
+3. ANYSCALE_ARTIFACT_STORAGE: the URI to the pre-generated folder for storing your artifacts while keeping them separate them from Anyscale-generated ones.
+
+
+You can view the saved model and artifacts by running:
+```bash
+aws s3 ls $ANYSCALE_ARTIFACT_STORAGE
+```
+
+Or, if you are on GCP:
+```bash
+gsutil ls $ANYSCALE_ARTIFACT_STORAGE
+```
+Authentication is automatcially handled by default.
 
 ### Submit as Anyscale Production Job
 From within your Anyscale Workspace, you can run your script as an Anyscale Job. This might be useful if you want to run things in production and have a long running job. You can test that each Anyscale Job will spin up its own cluster (with the same compute config and cluster environment as the Workspace) and run the script.  The Anyscale Job will automatically retry in event of failure and provides monitoring via the Ray Dashboard and Grafana.