Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vLLM+HPA support to ChatQnA Helm chart #610

Merged
merged 6 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions helm-charts/chatqna/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ dependencies:
- name: tgi
version: 0-latest
repository: "file://../common/tgi"
condition: tgi.enabled
- name: vllm
version: 0-latest
repository: "file://../common/vllm"
condition: vllm.enabled
- name: tei
version: 0-latest
repository: "file://../common/tei"
Expand Down
11 changes: 7 additions & 4 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi
- [teirerank](../common/teirerank/README.md)
- [llm-uservice](../common/llm-uservice/README.md)
- [tgi](../common/tgi/README.md)
- [vllm](../common/vllm/README.md)

## Installing the Chart

Expand All @@ -26,13 +27,15 @@ export MODELNAME="Intel/neural-chat-7b-v3-3"
# If you would like to use the traditional UI, please change the image as well as the containerport within the values
# append these at the end of the command "--set chatqna-ui.image.repository=opea/chatqna-ui,chatqna-ui.image.tag=latest,chatqna-ui.containerPort=5173"
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
# To use Gaudi device
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
# To use Gaudi device with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-tgi-values.yaml
# To use Gaudi device with vLLM
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-vllm-values.yaml
# To use Nvidia GPU
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
# To include guardrail component in chatqna on Xeon
# To include guardrail component in chatqna on Xeon with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml
# To include guardrail component in chatqna on Gaudi
# To include guardrail component in chatqna on Gaudi with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml
```

Expand Down
1 change: 1 addition & 0 deletions helm-charts/chatqna/ci-gaudi-tgi-values.yaml
1 change: 0 additions & 1 deletion helm-charts/chatqna/ci-gaudi-values.yaml

This file was deleted.

1 change: 1 addition & 0 deletions helm-charts/chatqna/ci-gaudi-vllm-values.yaml
63 changes: 63 additions & 0 deletions helm-charts/chatqna/gaudi-vllm-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Accelerate inferencing in heaviest components to improve performance
# by overriding their subchart values

tgi:
enabled: false

vllm:
enabled: true
accelDevice: "gaudi"
image:
repository: opea/vllm-gaudi
tag: "latest"
resources:
limits:
habana.ai/gaudi: 1
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 120
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1

PT_HPU_ENABLE_LAZY_COLLECTIVES: "true"
OMPI_MCA_btl_vader_single_copy_mechanism: "none"

extraCmdArgs: [
"--tensor-parallel-size", "1",
"--block-size", "128",
"--max-num-seqs", "256",
"--max-seq_len-to-capture", "2048"
]


# Reranking: second largest bottleneck when reranking is in use
# (i.e. query context docs have been uploaded with data-prep)
#
# TODO: could vLLM be used also for reranking / embedding?
teirerank:
accelDevice: "gaudi"
OMPI_MCA_btl_vader_single_copy_mechanism: "none"
MAX_WARMUP_SEQUENCE_LENGTH: "512"
image:
repository: ghcr.io/huggingface/tei-gaudi
tag: 1.5.0
resources:
limits:
habana.ai/gaudi: 1
securityContext:
readOnlyRootFilesystem: false
livenessProbe:
timeoutSeconds: 1
readinessProbe:
timeoutSeconds: 1
6 changes: 5 additions & 1 deletion helm-charts/chatqna/hpa-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# Enable HorizontalPodAutoscaler (HPA)
#
# That will overwrite named PrometheusAdapter configMap with ChatQnA specific
# custom metric queries for embedding, reranking, tgi services.
# custom metric queries for embedding, reranking, and LLM services.
#
# Default upstream configMap is in:
# - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml
Expand All @@ -15,6 +15,10 @@ autoscaling:
# Override values in specific subcharts

# Enabling "autoscaling" for any of the subcharts requires enabling it also above!
vllm:
autoscaling:
maxReplicas: 4
enabled: true
tgi:
autoscaling:
maxReplicas: 4
Expand Down
23 changes: 18 additions & 5 deletions helm-charts/chatqna/templates/custom-metrics-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,27 @@ metadata:
data:
config.yaml: |
rules:
{{- if .Values.tgi.autoscaling.enabled }}
{{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
# check metric with:
# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric> | jq
#
- seriesQuery: '{__name__="vllm:time_per_output_token_seconds_sum",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
# Average output token latency from vLLM histograms, over 1 min
# (interval should be at least 4x serviceMonitor query interval,
# 0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(vllm:time_per_output_token_seconds_sum{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(vllm:time_per_output_token_seconds_count{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^vllm:time_per_output_token_seconds_sum
as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_token_latency"
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
# (pod is not suitable object type for matching as each instance has different name)
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
{{- end }}
{{- if and .Values.tgi.enabled .Values.tgi.autoscaling.enabled }}
{{- if .Values.tgi.accelDevice }}
- seriesQuery: '{__name__="tgi_queue_size",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# TGI instances queue_size sum
Expand All @@ -27,16 +44,12 @@ data:
{{- else }}
- seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# Average request latency from TGI histograms, over 1 min
# (0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^tgi_request_inference_duration_sum
as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency"
{{- end }}
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
# (pod is not suitable object type for matching as each instance has different name)
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
Expand Down
8 changes: 8 additions & 0 deletions helm-charts/chatqna/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,19 @@ spec:
- name: {{ .Release.Name }}
env:
- name: LLM_SERVER_HOST_IP
{{- if .Values.vllm.enabled }}
value: {{ .Release.Name }}-vllm
{{- else }}
value: {{ .Release.Name }}-tgi
{{- end }}
- name: LLM_SERVER_PORT
value: "80"
- name: LLM_MODEL
{{- if .Values.vllm.enabled }}
value: {{ .Values.vllm.LLM_MODEL_ID | quote }}
{{- else }}
value: {{ .Values.tgi.LLM_MODEL_ID | quote }}
{{- end }}
- name: RERANK_SERVER_HOST_IP
value: {{ .Release.Name }}-teirerank
- name: RERANK_SERVER_PORT
Expand Down
4 changes: 4 additions & 0 deletions helm-charts/chatqna/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ autoscaling:

# Override values in specific subcharts
tgi:
enabled: true
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
vllm:
enabled: false
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

# disable guardrails-usvc by default
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/agent/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ tgi:
vllm:
enabled: false
LLM_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
extraCmdArgs: ["/bin/bash", "-c", "python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model mistralai/Mistral-7B-Instruct-v0.3 --tensor-parallel-size 1 --host 0.0.0.0 --port 2080 --download-dir /data --block-size 128 --max-num-seqs 4096 --max-seq_len-to-capture 8192 --enable-auto-tool-choice --tool-call-parser mistral"]
extraCmdArgs: ["--tensor-parallel-size", "1", "--block-size", "128", "--max-num-seqs", "4096", "--max-seq_len-to-capture", "8192", "--enable-auto-tool-choice", "--tool-call-parser", "mistral"]

replicaCount: 1
llm_endpoint_url: ""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ vllm:
tag: "latest"
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
OMPI_MCA_btl_vader_single_copy_mechanism: none
extraCmdArgs: ["--enforce-eager","--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
extraCmdArgs: ["--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
resources:
limits:
habana.ai/gaudi: 1
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/tei/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 2
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/teirerank/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 3
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/tgi/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 4
Expand Down
2 changes: 2 additions & 0 deletions helm-charts/common/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,5 @@ curl http://localhost:2080/v1/completions \
| global.modelUseHostPath | string | `""` | Cached models directory, vllm will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Set this to null/empty will force it to download model. |
| image.repository | string | `"opea/vllm"` | |
| image.tag | string | `"latest"` | |
| autoscaling.enabled | bool | `false` | Enable HPA autoscaling for the service deployment based on metrics it provides. See [HPA instructions](../../HPA.md) before enabling! |
| global.monitoring | bool | `false` | Enable usage metrics for the service. Required for HPA. See [monitoring instructions](../../monitoring.md) before enabling! |
6 changes: 3 additions & 3 deletions helm-charts/common/vllm/gaudi-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

accelDevice: "gaudi"

image:
repository: opea/vllm-gaudi
tag: "latest"

# VLLM_CPU_KVCACHE_SPACE: "40"
OMPI_MCA_btl_vader_single_copy_mechanism: none
extraCmdArgs: ["--enforce-eager","--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
# Workaround for current HPU image with start command /bin/bash
# extraCmdArgs: ["/bin/bash","-c","python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model Intel/neural-chat-7b-v3-3 --tensor-parallel-size 1 --host 0.0.0.0 --port 2080 --download-dir /data --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048"]
extraCmdArgs: ["--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
resources:
limits:
habana.ai/gaudi: 1
7 changes: 7 additions & 0 deletions helm-charts/common/vllm/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ Create chart name and version as used by the chart label.
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Convert chart name to a string suitable as metric prefix
*/}}
{{- define "vllm.metricPrefix" -}}
{{- include "vllm.fullname" . | replace "-" "_" | regexFind "[a-zA-Z_:][a-zA-Z0-9_:]*" }}
{{- end }}

{{/*
Common labels
*/}}
Expand Down
3 changes: 3 additions & 0 deletions helm-charts/common/vllm/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ data:
{{- if .Values.VLLM_CPU_KVCACHE_SPACE }}
VLLM_CPU_KVCACHE_SPACE: {{ .Values.VLLM_CPU_KVCACHE_SPACE | quote}}
{{- end }}
{{- if .Values.PT_HPU_ENABLE_LAZY_COLLECTIVES }}
PT_HPU_ENABLE_LAZY_COLLECTIVES: {{ .Values.PT_HPU_ENABLE_LAZY_COLLECTIVES | quote }}
{{- end }}
{{- if .Values.OMPI_MCA_btl_vader_single_copy_mechanism }}
OMPI_MCA_btl_vader_single_copy_mechanism: {{ .Values.OMPI_MCA_btl_vader_single_copy_mechanism | quote}}
{{- end }}
7 changes: 7 additions & 0 deletions helm-charts/common/vllm/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ metadata:
labels:
{{- include "vllm.labels" . | nindent 4 }}
spec:
{{- if ne (int .Values.replicaCount) 1 }}
# remove if replica count should not be reset on pod update (e.g. with HPA)
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "vllm.selectorLabels" . | nindent 6 }}
Expand Down Expand Up @@ -159,3 +162,7 @@ spec:
matchLabels:
{{- include "vllm.selectorLabels" . | nindent 14 }}
{{- end }}
{{- if not .Values.accelDevice }}
# extra time to finish processing buffered requests on CPU before pod is forcibly terminated
terminationGracePeriodSeconds: 120
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "vllm.fullname" . }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "vllm.fullname" . }}
minReplicas: 1
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
# get metric for named object of given type (in same namespace)
kind: Service
name: {{ include "vllm.fullname" . }}
target:
# Metric is sum from all pods. "AverageValue" divides value returned from
# the custom metrics API by the number of Pods before comparing to the target:
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
type: AverageValue
{{- if .Values.accelDevice }}
averageValue: 0.1
{{- else }}
# allow larger latencies with unaccelerated service
averageValue: 1.0
{{- end }}
metric:
name: {{ include "vllm.metricPrefix" . }}_token_latency
behavior:
scaleDown:
stabilizationWindowSeconds: 180
policies:
- type: Percent
value: 25
periodSeconds: 90
scaleUp:
selectPolicy: Max
stabilizationWindowSeconds: 0
policies:
# Slow linear rampup in case additional CPU pods go to same node
# (i.e. interfere with each other)
- type: Pods
value: 1
periodSeconds: 90
#- type: Percent
# value: 25
# periodSeconds: 90
{{- end }}
Loading
Loading