Skip to content

Commit

Permalink
Vllm ns merged 209 7d49516 (#272)
Browse files Browse the repository at this point in the history
* add benchmark run script, visualize script

* upd

* update multi replicas

* use --result-dir to parse results

* fix ci proxy

* add test ci

* add license

* fix

* fix

* add autoscaling config

* fix ci

* fix ci

* add package matplotlib

* verify CI test

* verify CI test

* create assets folder to place pictures

* verify CI test

* support openai autoscaling

* remove

* integrate vllm and ns

Signed-off-by: Jiafu Zhang <[email protected]>

* update config file

* integrate vllm and ns

Signed-off-by: Jiafu Zhang <[email protected]>

* integrate vllm and ns

Signed-off-by: Jiafu Zhang <[email protected]>

* remove .eggs

Signed-off-by: Jiafu Zhang <[email protected]>

* integration adjustment

Signed-off-by: Jiafu Zhang <[email protected]>

* llm on ray deployed

Signed-off-by: Jiafu Zhang <[email protected]>

* llm on ray deployed

Signed-off-by: Jiafu Zhang <[email protected]>

* llm on ray deployed

Signed-off-by: Jiafu Zhang <[email protected]>

* more doc

Signed-off-by: Jiafu Zhang <[email protected]>

* more doc for installing vllm ext

Signed-off-by: Jiafu Zhang <[email protected]>

* bug fix

Signed-off-by: Jiafu Zhang <[email protected]>

* save

Signed-off-by: Jiafu Zhang <[email protected]>

* add vllm-ext/requirements.txt

Signed-off-by: Jiafu Zhang <[email protected]>

* add CMakeLists.txt

Signed-off-by: Jiafu Zhang <[email protected]>

* changed benchmarks

Signed-off-by: Jiafu Zhang <[email protected]>

* tuned graph build

Signed-off-by: Jiafu Zhang <[email protected]>

* graph build time reduced

Signed-off-by: Jiafu Zhang <[email protected]>

* graph build time reduced

Signed-off-by: Jiafu Zhang <[email protected]>

* configurable perf stats and copy quant config automatically

Signed-off-by: Jiafu Zhang <[email protected]>

* save test script

Signed-off-by: Jiafu Zhang <[email protected]>

* add max_batched_tokens parameter

Signed-off-by: Jiafu Zhang <[email protected]>

* adjustment and ray-vllm-examples

Signed-off-by: Jiafu Zhang <[email protected]>

* perf tuned and improved by disable mmap for multiple instances

Signed-off-by: Jiafu Zhang <[email protected]>

* remove unnecessary thread sync in kernels

Signed-off-by: Jiafu Zhang <[email protected]>

* merged ns PR 209 7d49516

Signed-off-by: Jiafu Zhang <[email protected]>

* change order of loop, batch size first, then iteration

Signed-off-by: Jiafu Zhang <[email protected]>

* modified some examples

Signed-off-by: Jiafu Zhang <[email protected]>

* add more parameters for vllm-ns test

Signed-off-by: JoshuaL3000 <[email protected]>

* add more parameters for vllm-ns test

Signed-off-by: JoshuaL3000 <[email protected]>

* add more parameters for vllm-ns test

Signed-off-by: JoshuaL3000 <[email protected]>

* prevent quantization being messed-up with multiple processes

Signed-off-by: Jiafu Zhang <[email protected]>

* fix merge error

Signed-off-by: Jiafu Zhang <[email protected]>

* rename py to sh

Signed-off-by: Jiafu Zhang <[email protected]>

* fix formatting issue

Signed-off-by: Jiafu Zhang <[email protected]>

* fix formatting issue

Signed-off-by: Jiafu Zhang <[email protected]>

* fix merge error

Signed-off-by: JoshuaL3000 <[email protected]>

* add vllm-ns ci

Signed-off-by: Jiafu Zhang <[email protected]>

* remove unnecessary logs

Signed-off-by: Jiafu Zhang <[email protected]>

* remove some debug code

Signed-off-by: Jiafu Zhang <[email protected]>

* add '--privileged' to docker run

Signed-off-by: Jiafu Zhang <[email protected]>

* set unlimited max lock memory for neural speed engine

Signed-off-by: Jiafu Zhang <[email protected]>

* llama-3-8B support

Signed-off-by: Jiafu Zhang <[email protected]>

* extend token length limit to 8192 for mha

Signed-off-by: Jiafu Zhang <[email protected]>

* extend token length limit to 8192 for mha

Signed-off-by: Jiafu Zhang <[email protected]>

* extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding

Signed-off-by: Jiafu Zhang <[email protected]>

* extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding

Signed-off-by: Jiafu Zhang <[email protected]>

* add llama3 for plain cpu

Signed-off-by: Jiafu Zhang <[email protected]>

* benchmark idc simple/medium/complex/verycomplex prompts

Signed-off-by: Jiafu Zhang <[email protected]>

* benchmark idc simple/medium/complex/verycomplex prompts

Signed-off-by: Jiafu Zhang <[email protected]>

* benchmark idc simple/medium/complex/verycomplex prompts

Signed-off-by: Jiafu Zhang <[email protected]>

* add inference_engine resource and app_router resource to distinct engine worker and router worker since they have different resource config

Signed-off-by: Jiafu Zhang <[email protected]>

* enhanced benchmark script to support IDC test data

Signed-off-by: Jiafu Zhang <[email protected]>

* updated ray startup script to add resources for app_router and inference_engine

Signed-off-by: Jiafu Zhang <[email protected]>

* fix first token latency and next token latency issue in open-ai mode in benchmark script

Signed-off-by: Jiafu Zhang <[email protected]>

* updated ray startup script to add resources for app_router and inference_engine

Signed-off-by: Jiafu Zhang <[email protected]>

* addressed some review comments

Signed-off-by: Jiafu Zhang <[email protected]>

* fix lint issue

Signed-off-by: Jiafu Zhang <[email protected]>

* address review comment by getting number of threads from ray num-cpus and setting threads for cases not with ray

Signed-off-by: Jiafu Zhang <[email protected]>

---------

Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: JoshuaL3000 <[email protected]>
Co-authored-by: KepingYan <[email protected]>
Co-authored-by: JoshuaL3000 <[email protected]>
  • Loading branch information
3 people authored Jul 29, 2024
1 parent f5e42a5 commit 2eb6c2b
Show file tree
Hide file tree
Showing 185 changed files with 81,963 additions and 72 deletions.
1 change: 1 addition & 0 deletions .github/license/header_exclude_files.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
vllm-ext/vllm/extension/ns/__init__.py
8 changes: 6 additions & 2 deletions .github/workflows/workflow_inference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
name: inference
strategy:
matrix:
model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, gemma-2b, deepseek-coder-33b-instruct]
model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, llama-2-7b-chat-hf-vllm-ns, gemma-2b, deepseek-coder-33b-instruct]
isPR:
- ${{inputs.ci_type == 'pr'}}

Expand Down Expand Up @@ -97,7 +97,11 @@ jobs:
run: |
TARGET=${{steps.target.outputs.target}}
source dev/scripts/ci-functions.sh
strat_ray ${TARGET}
if [[ "$TARGET" == *ns ]]; then
start_ray ${TARGET} 1
else
start_ray ${TARGET}
fi
- name: Run Inference Test
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_inference_gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ jobs:
# check and remove exited container
cid=$(docker ps -a -q --filter "name=${TARGET}")
if [[ ! -z "$cid" ]]; then docker rm $cid; fi
docker run -tid --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
docker run -tid --privileged --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
- name: Start Ray Cluster
run: |
TARGET=${{steps.target.outputs.target}}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_test_benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ jobs:
# check and remove exited container
cid=$(docker ps -a -q --filter "name=${TARGET}")
if [[ ! -z "$cid" ]]; then docker rm $cid; fi
docker run -tid -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest
docker run -tid --privileged -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest
- name: Start Ray Cluster
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ jobs:
run: |
TARGET=${{steps.target.outputs.target}}
source dev/scripts/ci-functions.sh
strat_ray ${TARGET}
start_ray ${TARGET}
- name: Run Tests
run: |
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,9 @@ build/lib/
*.json
*.txt
*.egg-info
.eggs
*.log
*.so
*.ninja_log
build/
runtime_outs/
19 changes: 18 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ repos:
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix, --ignore=E402, --ignore=E501, --ignore=E731, --ignore=F401]
exclude: |
(?x)^(
examples/inference/vllm/ray-vllm-examples/llm.py|
vllm-ext/vllm/extension/ns/__init__.py|
)$
# Black needs to be ran after ruff with --fix
- repo: https://github.com/psf/black
Expand All @@ -18,7 +24,18 @@ repos:
rev: "v0.981"
hooks:
- id: mypy
exclude: tests
exclude: |
(?x)^(
tests|
vllm-ext/vllm/extension/ns/model/ns_loader.py|
vllm-ext/vllm/extension/ns/kv_cache/ns_cache.py|
vllm-ext/inference_engine/python/inference_engine/|
vllm-ext/setup.py|
examples/inference/vllm/ray-vllm-examples/llm.py|
llm_on_ray/inference/inference_config.py|
vllm-ext/vllm/extension/ns/
)
additional_dependencies:
- mypy-extensions
- pydantic==1.10.0
Expand Down
Loading

0 comments on commit 2eb6c2b

Please sign in to comment.