Skip to content

Latest commit

 

History

History

Part_7-iterative_scheduling

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Deploying a GPT-2 Model using Python Backend and Iterative Scheduling

Navigate to Part 6: Building Complex Pipelines: Stable Diffusion Documentation: Iterative Scheduling

In this tutorial, we will deploy a GPT-2 model using the Python backend and demonstrate the iterative scheduling feature.

Prerequisites

Before getting started with this tutorial, make sure you're familiar with the following concepts:

Iterative Scheduling

Iterative scheduling is a technique that allows the Triton Inference Server to schedule the same request multiple times with the same input. This is useful for models that have an auto-regressive loop. Iterative scheduling enables Triton Server to implement inflight batching for your models and gives you the ability to combine new sequences as they are arriving with inflight sequences.

Tutorial Overview

In this tutorial we deploy two models:

  • simple-gpt2: This model receives a batch of requests and proceeds to the next batch only when it is done generating tokens for the current batch.

  • iterative-gpt2: This model uses iterative scheduling to process new sequences in a batch even when it is still generating tokens for the previous sequences

Demo

asciicast

Step 1: Prepare the Server Environment

  • First, run the Triton Inference Server Container:
# Replace yy.mm with year and month of release. Please use 24.04 release upward.
docker run --gpus=all --name iterative-scheduling -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
  • Next, install all the dependencies required by the models running in the python backend and login with your huggingface token (Account on HuggingFace is required).
pip install transformers[torch]

Note

Optional: If you want to avoid installing the dependencies each time you run the container, you can run docker commit iterative-scheduling iterative-scheduling-image to save the container and use that for subsequent runs.

Then, start the server:

tritonserver --model-repository=/models

Step 2: Install the client side dependencies

In another terminal install the client dependencies:

pip3 install tritonclient[grpc]
pip3 install tqdm

Step 3: Run the client

The simple-gpt2 model doesn't use iterative scheduling and will proceed to the next batch only when it is done generating tokens for the current batch.

Run the following command to start the client:

python3 client/client.py --model simple-gpt2

As you can see, the tokens for one request are processed first before proceeding to the next request.

Run Ctrl+C to stop the client.

The iterative scheduler is able to incorporate new requests as they are arriving in the server.

Run the following command to start the client:

python3 client/client.py --model iterative-gpt2

As you can see, the tokens for both prompts are getting generated simultaneously.

Next Steps

We plan to integrate KV-Cache with these models for better performance. Currently, the main goal of tutorial is to demonstrate how to use iterative scheduling with Python backend.