Distributed LLM Inference Service

Introduction

The project aims to design and deploy a scalable, efficient cloud-based inference service for large language models (LLMs) using Kubernetes on Google Cloud. Leveraging vLLM, an open-source library for optimizing LLMs, the service addresses challenges in memory consumption and latency.

Prerequisites

Before proceeding, ensure you have:

A Google Cloud Platform (GCP) account
gcloud CLI installed and authenticated
kubectl CLI installed and configured
Docker installed and configured to push images to a container registry

Deployment Steps

1. Setup GKE Cluster

Create a GKE cluster on GCP and authenticate with:

 gcloud container clusters get-credentials final-project --region us-central1 --project cml-finals

2. Verify Kubernetes Deployments and Services

Check existing deployments and services:

kubectl get deployments
kubectl get svc

3. Deploy Database Service

cd db-service
kubectl apply -f deployment-db.yaml
kubectl apply -f service-db.yaml

4. Deploy Pub-Sub Service

cd pub-sub-service/deployment
kubectl apply -f deployment-pub-sub.yaml
kubectl apply -f service-pub-sub.yaml

5. Test Database Service and RabbitMQ

Ensure both services are running before proceeding.

6. Install NVIDIA GPU API (If Required)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

7. Build and Deploy LLM Service

cd llm-service
docker build . -t ad060398/llm-service --no-cache --platform=linux/amd64
docker push ad060398/llm-service
kubectl apply -f deployment/deployment-llm-service.yaml
kubectl apply -f deployment/service-llm-service.yaml

8. Build and Deploy API Server

cd api-server
docker build . -t ad060398/api-server --no-cache --platform=linux/amd64
docker push ad060398/api-server
kubectl apply -f deployment/deployment-api-server.yaml
kubectl apply -f deployment/service-api-server.yaml

9. Verify Running Services

kubectl get pods  # List running pods
kubectl logs <pod_name>  # View logs for a specific pod
kubectl get svc  # List services

10. Send API Requests

Retrieve the external IP of the API server from kubectl get svc and use it to test the service:

curl -X POST http://<external-ip>/chat \
     -H "Content-Type: application/json" \
     -d '{"text": "Hello, LLM!"}'

curl http://<external-ip>/status/<job_id>

Performance Testing

1. Run Load Test with Locust

locust load_test.py

2. Open Locust UI

Access Locust dashboard via:

http://localhost:8089

Configure and start the test from the web interface.

Benchmark Results

Conclusion

Following these steps will set up and deploy all services required for the project. Ensure each service is running correctly before proceeding to the next step. If you encounter issues, use kubectl logs and kubectl describe to debug any deployment errors.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
scripts		scripts
services		services
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed LLM Inference Service

Introduction

Prerequisites

Deployment Steps

1. Setup GKE Cluster

2. Verify Kubernetes Deployments and Services

3. Deploy Database Service

4. Deploy Pub-Sub Service

5. Test Database Service and RabbitMQ

6. Install NVIDIA GPU API (If Required)

7. Build and Deploy LLM Service

8. Build and Deploy API Server

9. Verify Running Services

10. Send API Requests

Performance Testing

1. Run Load Test with Locust

2. Open Locust UI

Benchmark Results

Conclusion

About

Releases

Packages

Contributors 2

Languages

License

ad6398/llm-inference-service

Folders and files

Latest commit

History

Repository files navigation

Distributed LLM Inference Service

Introduction

Prerequisites

Deployment Steps

1. Setup GKE Cluster

2. Verify Kubernetes Deployments and Services

3. Deploy Database Service

4. Deploy Pub-Sub Service

5. Test Database Service and RabbitMQ

6. Install NVIDIA GPU API (If Required)

7. Build and Deploy LLM Service

8. Build and Deploy API Server

9. Verify Running Services

10. Send API Requests

Performance Testing

1. Run Load Test with Locust

2. Open Locust UI

Benchmark Results

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages