Skip to content

Commit

Permalink
Add vLLM+HPA support to ChatQnA Helm chart (#610)
Browse files Browse the repository at this point in the history
* Add monitoring support for the vLLM component

Signed-off-by: Eero Tamminen <[email protected]>

* Initial vLLM support for ChatQnA

For now vLLM replaces just TGI, but as it supports also embedding,
also TEI-embed/-rerank may be replaceable later on.

Signed-off-by: Eero Tamminen <[email protected]>

* Fix HPA comments in tgi/tei/tererank values files

Signed-off-by: Eero Tamminen <[email protected]>

* Add HPA scaling support for ChatQnA / vLLM

Signed-off-by: Eero Tamminen <[email protected]>

* Adapt to latest vllm changes

- Remove --eager-enforce on hpu to improve performance
- Refactor to the upstream docker entrypoint changes

Fixes issue #631.

Signed-off-by: Lianhao Lu <[email protected]>

* Clean up ChatQnA vLLM Gaudi parameters

Signed-off-by: Eero Tamminen <[email protected]>

---------

Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Lianhao Lu <[email protected]>
Co-authored-by: Lianhao Lu <[email protected]>
  • Loading branch information
eero-t and lianhao authored Dec 18, 2024
1 parent a4a96ab commit baed0b5
Show file tree
Hide file tree
Showing 25 changed files with 253 additions and 22 deletions.
5 changes: 5 additions & 0 deletions helm-charts/chatqna/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ dependencies:
- name: tgi
version: 0-latest
repository: "file://../common/tgi"
condition: tgi.enabled
- name: vllm
version: 0-latest
repository: "file://../common/vllm"
condition: vllm.enabled
- name: tei
version: 0-latest
repository: "file://../common/tei"
Expand Down
11 changes: 7 additions & 4 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi
- [teirerank](../common/teirerank/README.md)
- [llm-uservice](../common/llm-uservice/README.md)
- [tgi](../common/tgi/README.md)
- [vllm](../common/vllm/README.md)

## Installing the Chart

Expand All @@ -26,13 +27,15 @@ export MODELNAME="Intel/neural-chat-7b-v3-3"
# If you would like to use the traditional UI, please change the image as well as the containerport within the values
# append these at the end of the command "--set chatqna-ui.image.repository=opea/chatqna-ui,chatqna-ui.image.tag=latest,chatqna-ui.containerPort=5173"
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
# To use Gaudi device
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
# To use Gaudi device with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-tgi-values.yaml
# To use Gaudi device with vLLM
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-vllm-values.yaml
# To use Nvidia GPU
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
# To include guardrail component in chatqna on Xeon
# To include guardrail component in chatqna on Xeon with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml
# To include guardrail component in chatqna on Gaudi
# To include guardrail component in chatqna on Gaudi with TGI
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml
```

Expand Down
1 change: 1 addition & 0 deletions helm-charts/chatqna/ci-gaudi-tgi-values.yaml
1 change: 0 additions & 1 deletion helm-charts/chatqna/ci-gaudi-values.yaml

This file was deleted.

1 change: 1 addition & 0 deletions helm-charts/chatqna/ci-gaudi-vllm-values.yaml
File renamed without changes.
63 changes: 63 additions & 0 deletions helm-charts/chatqna/gaudi-vllm-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Accelerate inferencing in heaviest components to improve performance
# by overriding their subchart values

tgi:
enabled: false

vllm:
enabled: true
accelDevice: "gaudi"
image:
repository: opea/vllm-gaudi
tag: "latest"
resources:
limits:
habana.ai/gaudi: 1
startupProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
failureThreshold: 120
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1
livenessProbe:
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 1

PT_HPU_ENABLE_LAZY_COLLECTIVES: "true"
OMPI_MCA_btl_vader_single_copy_mechanism: "none"

extraCmdArgs: [
"--tensor-parallel-size", "1",
"--block-size", "128",
"--max-num-seqs", "256",
"--max-seq_len-to-capture", "2048"
]


# Reranking: second largest bottleneck when reranking is in use
# (i.e. query context docs have been uploaded with data-prep)
#
# TODO: could vLLM be used also for reranking / embedding?
teirerank:
accelDevice: "gaudi"
OMPI_MCA_btl_vader_single_copy_mechanism: "none"
MAX_WARMUP_SEQUENCE_LENGTH: "512"
image:
repository: ghcr.io/huggingface/tei-gaudi
tag: 1.5.0
resources:
limits:
habana.ai/gaudi: 1
securityContext:
readOnlyRootFilesystem: false
livenessProbe:
timeoutSeconds: 1
readinessProbe:
timeoutSeconds: 1
6 changes: 5 additions & 1 deletion helm-charts/chatqna/hpa-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# Enable HorizontalPodAutoscaler (HPA)
#
# That will overwrite named PrometheusAdapter configMap with ChatQnA specific
# custom metric queries for embedding, reranking, tgi services.
# custom metric queries for embedding, reranking, and LLM services.
#
# Default upstream configMap is in:
# - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml
Expand All @@ -15,6 +15,10 @@ autoscaling:
# Override values in specific subcharts

# Enabling "autoscaling" for any of the subcharts requires enabling it also above!
vllm:
autoscaling:
maxReplicas: 4
enabled: true
tgi:
autoscaling:
maxReplicas: 4
Expand Down
23 changes: 18 additions & 5 deletions helm-charts/chatqna/templates/custom-metrics-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,27 @@ metadata:
data:
config.yaml: |
rules:
{{- if .Values.tgi.autoscaling.enabled }}
{{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
# check metric with:
# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric> | jq
#
- seriesQuery: '{__name__="vllm:time_per_output_token_seconds_sum",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
# Average output token latency from vLLM histograms, over 1 min
# (interval should be at least 4x serviceMonitor query interval,
# 0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(vllm:time_per_output_token_seconds_sum{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(vllm:time_per_output_token_seconds_count{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^vllm:time_per_output_token_seconds_sum
as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_token_latency"
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
# (pod is not suitable object type for matching as each instance has different name)
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
{{- end }}
{{- if and .Values.tgi.enabled .Values.tgi.autoscaling.enabled }}
{{- if .Values.tgi.accelDevice }}
- seriesQuery: '{__name__="tgi_queue_size",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# TGI instances queue_size sum
Expand All @@ -27,16 +44,12 @@ data:
{{- else }}
- seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# Average request latency from TGI histograms, over 1 min
# (0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^tgi_request_inference_duration_sum
as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency"
{{- end }}
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
# (pod is not suitable object type for matching as each instance has different name)
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
Expand Down
8 changes: 8 additions & 0 deletions helm-charts/chatqna/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,19 @@ spec:
- name: {{ .Release.Name }}
env:
- name: LLM_SERVER_HOST_IP
{{- if .Values.vllm.enabled }}
value: {{ .Release.Name }}-vllm
{{- else }}
value: {{ .Release.Name }}-tgi
{{- end }}
- name: LLM_SERVER_PORT
value: "80"
- name: LLM_MODEL
{{- if .Values.vllm.enabled }}
value: {{ .Values.vllm.LLM_MODEL_ID | quote }}
{{- else }}
value: {{ .Values.tgi.LLM_MODEL_ID | quote }}
{{- end }}
- name: RERANK_SERVER_HOST_IP
value: {{ .Release.Name }}-teirerank
- name: RERANK_SERVER_PORT
Expand Down
4 changes: 4 additions & 0 deletions helm-charts/chatqna/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ autoscaling:

# Override values in specific subcharts
tgi:
enabled: true
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
vllm:
enabled: false
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

# disable guardrails-usvc by default
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/agent/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ tgi:
vllm:
enabled: false
LLM_MODEL_ID: "mistralai/Mistral-7B-Instruct-v0.3"
extraCmdArgs: ["/bin/bash", "-c", "python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model mistralai/Mistral-7B-Instruct-v0.3 --tensor-parallel-size 1 --host 0.0.0.0 --port 2080 --download-dir /data --block-size 128 --max-num-seqs 4096 --max-seq_len-to-capture 8192 --enable-auto-tool-choice --tool-call-parser mistral"]
extraCmdArgs: ["--tensor-parallel-size", "1", "--block-size", "128", "--max-num-seqs", "4096", "--max-seq_len-to-capture", "8192", "--enable-auto-tool-choice", "--tool-call-parser", "mistral"]

replicaCount: 1
llm_endpoint_url: ""
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/llm-uservice/ci-vllm-gaudi-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ vllm:
tag: "latest"
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
OMPI_MCA_btl_vader_single_copy_mechanism: none
extraCmdArgs: ["--enforce-eager","--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
extraCmdArgs: ["--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
resources:
limits:
habana.ai/gaudi: 1
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/tei/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 2
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/teirerank/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 3
Expand Down
2 changes: 1 addition & 1 deletion helm-charts/common/tgi/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ replicaCount: 1

# Enabling HPA will:
# - Ignore above replica count, as it will be controlled by HPA
# - Add example HPA scaling rules with thresholds suitable for Xeon deployments
# - Add example HPA scaling rules with custom metrics thresholds
# - Require custom metrics ConfigMap available in the main application chart
autoscaling:
maxReplicas: 4
Expand Down
2 changes: 2 additions & 0 deletions helm-charts/common/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,5 @@ curl http://localhost:2080/v1/completions \
| global.modelUseHostPath | string | `""` | Cached models directory, vllm will not download if the model is cached here. The host path "modelUseHostPath" will be mounted to container as /data directory. Set this to null/empty will force it to download model. |
| image.repository | string | `"opea/vllm"` | |
| image.tag | string | `"latest"` | |
| autoscaling.enabled | bool | `false` | Enable HPA autoscaling for the service deployment based on metrics it provides. See [HPA instructions](../../HPA.md) before enabling! |
| global.monitoring | bool | `false` | Enable usage metrics for the service. Required for HPA. See [monitoring instructions](../../monitoring.md) before enabling! |
6 changes: 3 additions & 3 deletions helm-charts/common/vllm/gaudi-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,15 @@
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

accelDevice: "gaudi"

image:
repository: opea/vllm-gaudi
tag: "latest"

# VLLM_CPU_KVCACHE_SPACE: "40"
OMPI_MCA_btl_vader_single_copy_mechanism: none
extraCmdArgs: ["--enforce-eager","--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
# Workaround for current HPU image with start command /bin/bash
# extraCmdArgs: ["/bin/bash","-c","python3 -m vllm.entrypoints.openai.api_server --enforce-eager --model Intel/neural-chat-7b-v3-3 --tensor-parallel-size 1 --host 0.0.0.0 --port 2080 --download-dir /data --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048"]
extraCmdArgs: ["--tensor-parallel-size","1","--block-size","128","--max-num-seqs","256","--max-seq_len-to-capture","2048"]
resources:
limits:
habana.ai/gaudi: 1
7 changes: 7 additions & 0 deletions helm-charts/common/vllm/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ Create chart name and version as used by the chart label.
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Convert chart name to a string suitable as metric prefix
*/}}
{{- define "vllm.metricPrefix" -}}
{{- include "vllm.fullname" . | replace "-" "_" | regexFind "[a-zA-Z_:][a-zA-Z0-9_:]*" }}
{{- end }}

{{/*
Common labels
*/}}
Expand Down
3 changes: 3 additions & 0 deletions helm-charts/common/vllm/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ data:
{{- if .Values.VLLM_CPU_KVCACHE_SPACE }}
VLLM_CPU_KVCACHE_SPACE: {{ .Values.VLLM_CPU_KVCACHE_SPACE | quote}}
{{- end }}
{{- if .Values.PT_HPU_ENABLE_LAZY_COLLECTIVES }}
PT_HPU_ENABLE_LAZY_COLLECTIVES: {{ .Values.PT_HPU_ENABLE_LAZY_COLLECTIVES | quote }}
{{- end }}
{{- if .Values.OMPI_MCA_btl_vader_single_copy_mechanism }}
OMPI_MCA_btl_vader_single_copy_mechanism: {{ .Values.OMPI_MCA_btl_vader_single_copy_mechanism | quote}}
{{- end }}
7 changes: 7 additions & 0 deletions helm-charts/common/vllm/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ metadata:
labels:
{{- include "vllm.labels" . | nindent 4 }}
spec:
{{- if ne (int .Values.replicaCount) 1 }}
# remove if replica count should not be reset on pod update (e.g. with HPA)
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "vllm.selectorLabels" . | nindent 6 }}
Expand Down Expand Up @@ -159,3 +162,7 @@ spec:
matchLabels:
{{- include "vllm.selectorLabels" . | nindent 14 }}
{{- end }}
{{- if not .Values.accelDevice }}
# extra time to finish processing buffered requests on CPU before pod is forcibly terminated
terminationGracePeriodSeconds: 120
{{- end }}
57 changes: 57 additions & 0 deletions helm-charts/common/vllm/templates/horizontal-pod-autoscaler.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "vllm.fullname" . }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "vllm.fullname" . }}
minReplicas: 1
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
- type: Object
object:
describedObject:
apiVersion: v1
# get metric for named object of given type (in same namespace)
kind: Service
name: {{ include "vllm.fullname" . }}
target:
# Metric is sum from all pods. "AverageValue" divides value returned from
# the custom metrics API by the number of Pods before comparing to the target:
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
type: AverageValue
{{- if .Values.accelDevice }}
averageValue: 0.1
{{- else }}
# allow larger latencies with unaccelerated service
averageValue: 1.0
{{- end }}
metric:
name: {{ include "vllm.metricPrefix" . }}_token_latency
behavior:
scaleDown:
stabilizationWindowSeconds: 180
policies:
- type: Percent
value: 25
periodSeconds: 90
scaleUp:
selectPolicy: Max
stabilizationWindowSeconds: 0
policies:
# Slow linear rampup in case additional CPU pods go to same node
# (i.e. interfere with each other)
- type: Pods
value: 1
periodSeconds: 90
#- type: Percent
# value: 25
# periodSeconds: 90
{{- end }}
Loading

0 comments on commit baed0b5

Please sign in to comment.