Triton Inference Server OpenAI compatible API proxy

This project provides an OpenAI API compatible proxy for NVIDIA Triton Inference Server. More specifically, LLMs on NVIDIA GPUs can benefit from high performance inference with TensorRT-LLM backend running on Triton Inference Server compared to using llama.cpp.

Triton Inference Server supports HTTP/REST and GRPC inference protocols based on the community developed KServe protocol, but that is not useable with existing OpenAI API clients.

This proxy bridges that gap and it currently API supports text generation OpenAI API endpoints only which are suitable for use with Open WebUI or similar OpenAI clients-

GET|POST /v1/models           (or /models)
GET      /v1/models/{model}   (or /models/{model})
POST     /v1/chat/completions (or /v1/completions) streaming supported

Docker image

Recommended Use a pre-published Docker image

docker image pull visitsb/tritonserver:24.07-trtllm-python-py3

Alternatively, use the Dockerfile to build a local image. The proxy is built on top of existing Triton Inference Server docker image which includes the TensorRT-LLM backend.

# Pull upstream NVIDIA docker image
docker image pull nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# Clone this repository
git clone <this repository>
cd triton-inference-server-openai-api
# Build your custom docker image with proxy bundled
docker buildx build --no-cache --tag myimages/tritonserver:24.07-trtllm-python-py3 .

Usage

Once your image is pulled (or built locally) you can run it directly using Docker-

# Run Triton Inference Server alongwith proxy as shoen in `sh -c` command
docker run --rm --tty --interactive \
       --gpus all --shm-size 4g --memory 32g \
       --cpuset-cpus 0-3 --publish 11434:11434/tcp \
       --volume <your Triton models folder>:/models:rw \
       --name triton \
       visitsb/tritonserver:24.07-trtllm-python-py3 \
       sh -c '/opt/tritonserver/bin/tritonserver \
              --model-store /models/mymodel/model \
            & /opt/tritonserver/bin/tritonopenaiserver \
              --tokenizer_dir /models/mymodel/tokenizer \
              --engine_dir /models/mymodel/engine'

Alternatively using docker-compose.yml-

triton:
    image: visitsb/tritonserver:24.07-trtllm-python-py3
    command: >
      sh -c '/opt/tritonserver/bin/tritonserver --model-store /models/mymodel/model & /opt/tritonserver/bin/tritonopenaiserver --tokenizer_dir /models/mymodel/tokenizer --engine_dir /models/mymodel/engine'
    ports:
      - "11434:11434/tcp" # OpenAI API Proxy
      - "8000:8000/tcp"   # HTTP
      - "8001:8001/tcp"   # GRPC
      - "8080:8080/tcp"   # Sagemaker, Vertex
      - "8002:8002/tcp"   # Prometheus metrics
    volumes:
      - <your Triton models folder>:/models:rw
    shm_size: "4G"
    deploy:
      resources:
        limits:
          memory: 32G
        reservations:
          memory: 8G
          devices: 
            - driver: nvidia
              count: all
              capabilities: [compute,video,utility]
    ulimits:
      stack: 67108864
      memlock:
        soft: -1
        hard: -1

Performance

Using GenAI-Perf to measure performance for meta-llama/Meta-Llama-3-8B on a NVIDIA RTX 4090 GPU the following was observed-

Test: meta-llama/Meta-Llama-3-8B-Instruct evaluated using NVIDIA GenAI-Perf. For llama.cpp evaluation QuantFactory/Meta-Llama-3-8B-GGUF - Meta-Llama-3-8B.Q8_0.gguf was used.

Backend          Loaded model size     GPU Util Tokens/sec
-------          -----------------     -------- ----------
TensorRT (gRPC)  15879MiB /  24564MiB  91%      97.04
TensorRT (HTTP)  15879MiB /  24564MiB  91%      56.73 
llama.cpp         9491MiB /  24564MiB  74%      70.23

In summary, TensorRT (gRPC) inference is better than llama.cpp, but using TensorRT (HTTP) gave similar performance to llama.cpp.

The raw performance numbers are as below-

TensorRT (gRPC)

[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --service-kind triton -u triton:8001 --measurement-interval 4000 --stability-percentage 999 -i grpc --streaming --shape max_tokens:1 --shape text_input:1 --concurrency-range 1'
                                  LLM Metrics                                   
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃            Statistic ┃    avg ┃    min ┃     max ┃    p99 ┃     p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ Request latency (ns) │ 1,081… │ 1,048… │ 1,311,… │ 1,284… │ 1,083,… │ 1,064… │
│     Num output token │    105 │    100 │     110 │    110 │     109 │    107 │
│      Num input token │    200 │    200 │     200 │    200 │     200 │    200 │
└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘
Output token throughput (per sec): 97.04
Request throughput (per sec): 0.92

TensorRT (HTTP) via this OpenAI API Proxy

[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --endpoint v1/chat/completions --service-kind openai -u triton:11434 --measurement-interval 4000 --stability-percentage 999 -i http --concurrency-range 1'
                                  LLM Metrics                                   
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃            Statistic ┃    avg ┃    min ┃     max ┃    p99 ┃     p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ Request latency (ns) │ 2,033… │ 1,732… │ 3,856,… │ 3,723… │ 2,525,… │ 1,802… │
│     Num output token │    115 │    110 │     121 │    121 │     120 │    119 │
│      Num input token │    200 │    200 │     200 │    200 │     200 │    200 │
└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘
Output token throughput (per sec): 56.73
Request throughput (per sec): 0.49

llama.cpp

[INFO] genai_perf.wrapper:135 - Running Perf Analyzer : 'perf_analyzer -m llama3 --async --endpoint v1/chat/completions --service-kind openai -u llama:11434 --measurement-interval 4000 --stability-percentage 999 -i http --concurrency-range 1'
                                  LLM Metrics                                   
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓
┃            Statistic ┃    avg ┃    min ┃     max ┃    p99 ┃     p90 ┃    p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩
│ Request latency (ns) │ 1,656… │ 1,596… │ 1,822,… │ 1,810… │ 1,701,… │ 1,649… │
│     Num output token │    116 │    104 │     149 │    147 │     132 │    118 │
│      Num input token │    200 │    200 │     200 │    200 │     200 │    200 │
└──────────────────────┴────────┴────────┴─────────┴────────┴─────────┴────────┘
Output token throughput (per sec): 70.23
Request throughput (per sec): 0.60

Note This proxy uses TensorRT (HTTP) currently, so above performance numbers should be considered relative. Performance will vary for TensorRT-LLM models based on build and deployment options used.

Additional optimizations like speculative sampling and FP8 quantization can further improve throughput. For more on the throughput levels that are possible with TensorRT-LLM for different combinations of model, hardware, and workload, see the official benchmarks.

Build and deploy your own models

The image includes TensorRT-LLM toolbox and backend for building your own TensorRT-LLM models. Both can be found under /opt/tritonserver/third-party-src/ inside your Docker image.

The basic steps to build a TensorRT model are outlined here which essentially involves

Downloading a Hugging Face model of your choice,
Converting it to a TensorRT format, and
Lastly building a compiled model that can be deployed on Triton Inference Server.

Additionally, you can also use the steps mentioned here to build your TensorRT model. Once your model is built, you can deploy and use it through the OpenAI API proxy.

Further references-

Benchmarking NVIDIA TensorRT-LLM - TensorRT-LLM was 30-70% faster than llama.cpp on the same hardware, consumes less memory on consecutive runs with marginally more GPU VRAM utilization than llama.cpp and models are 20%+ smaller compiled model sizes than llama.cpp.
Use Llama 3 with NVIDIA TensorRT-LLM and Triton Inference Server - 30-minute tutorial to show how to use TensorRT-LLM to build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs using Llama3 model as an example.
Similar guide can be on Serverless TensorRT-LLM (LLaMA 3 8B) - how to use the TensorRT-LLM framework to serve Meta’s LLaMA 3 8B model at a total throughput of roughly 4,500 output tokens per second on a single NVIDIA A100 40GB GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
.dockerignore		.dockerignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
protocol.py		protocol.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton Inference Server OpenAI compatible API proxy

Docker image

Usage

Performance

TensorRT (gRPC)

TensorRT (HTTP) via this OpenAI API Proxy

llama.cpp

Build and deploy your own models

Further references-

About

Releases

Packages

Languages

License

visitsb/triton-inference-server-openai-api

Folders and files

Latest commit

History

Repository files navigation

Triton Inference Server OpenAI compatible API proxy

Docker image

Usage

Performance

TensorRT (gRPC)

TensorRT (HTTP) via this OpenAI API Proxy

llama.cpp

Build and deploy your own models

Further references-

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages