The project aims to design and deploy a scalable, efficient cloud-based inference service for large language models (LLMs) using Kubernetes on Google Cloud. Leveraging vLLM, an open-source library for optimizing LLMs, the service addresses challenges in memory consumption and latency.
Before proceeding, ensure you have:
- A Google Cloud Platform (GCP) account
gcloud
CLI installed and authenticatedkubectl
CLI installed and configured- Docker installed and configured to push images to a container registry
Create a GKE cluster on GCP and authenticate with:
gcloud container clusters get-credentials final-project --region us-central1 --project cml-finals
Check existing deployments and services:
kubectl get deployments
kubectl get svc
cd db-service
kubectl apply -f deployment-db.yaml
kubectl apply -f service-db.yaml
cd pub-sub-service/deployment
kubectl apply -f deployment-pub-sub.yaml
kubectl apply -f service-pub-sub.yaml
Ensure both services are running before proceeding.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
cd llm-service
docker build . -t ad060398/llm-service --no-cache --platform=linux/amd64
docker push ad060398/llm-service
kubectl apply -f deployment/deployment-llm-service.yaml
kubectl apply -f deployment/service-llm-service.yaml
cd api-server
docker build . -t ad060398/api-server --no-cache --platform=linux/amd64
docker push ad060398/api-server
kubectl apply -f deployment/deployment-api-server.yaml
kubectl apply -f deployment/service-api-server.yaml
kubectl get pods # List running pods
kubectl logs <pod_name> # View logs for a specific pod
kubectl get svc # List services
Retrieve the external IP of the API server from kubectl get svc
and use it to test the service:
curl -X POST http://<external-ip>/chat \
-H "Content-Type: application/json" \
-d '{"text": "Hello, LLM!"}'
curl http://<external-ip>/status/<job_id>
locust load_test.py
Access Locust dashboard via:
http://localhost:8089
Configure and start the test from the web interface.

Following these steps will set up and deploy all services required for the project. Ensure each service is running correctly before proceeding to the next step. If you encounter issues, use kubectl logs
and kubectl describe
to debug any deployment errors.