Vllm ns merged 209 7d49516 (#272)

* add benchmark run script, visualize script * upd * update multi replicas * use --result-dir to parse results * fix ci proxy * add test ci * add license * fix * fix * add autoscaling config * fix ci * fix ci * add package matplotlib * verify CI test * verify CI test * create assets folder to place pictures * verify CI test * support openai autoscaling * remove * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * update config file * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * remove .eggs Signed-off-by: Jiafu Zhang <[email protected]> * integration adjustment Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * more doc Signed-off-by: Jiafu Zhang <[email protected]> * more doc for installing vllm ext Signed-off-by: Jiafu Zhang <[email protected]> * bug fix Signed-off-by: Jiafu Zhang <[email protected]> * save Signed-off-by: Jiafu Zhang <[email protected]> * add vllm-ext/requirements.txt Signed-off-by: Jiafu Zhang <[email protected]> * add CMakeLists.txt Signed-off-by: Jiafu Zhang <[email protected]> * changed benchmarks Signed-off-by: Jiafu Zhang <[email protected]> * tuned graph build Signed-off-by: Jiafu Zhang <[email protected]> * graph build time reduced Signed-off-by: Jiafu Zhang <[email protected]> * graph build time reduced Signed-off-by: Jiafu Zhang <[email protected]> * configurable perf stats and copy quant config automatically Signed-off-by: Jiafu Zhang <[email protected]> * save test script Signed-off-by: Jiafu Zhang <[email protected]> * add max_batched_tokens parameter Signed-off-by: Jiafu Zhang <[email protected]> * adjustment and ray-vllm-examples Signed-off-by: Jiafu Zhang <[email protected]> * perf tuned and improved by disable mmap for multiple instances Signed-off-by: Jiafu Zhang <[email protected]> * remove unnecessary thread sync in kernels Signed-off-by: Jiafu Zhang <[email protected]> * merged ns PR 209 7d49516 Signed-off-by: Jiafu Zhang <[email protected]> * change order of loop, batch size first, then iteration Signed-off-by: Jiafu Zhang <[email protected]> * modified some examples Signed-off-by: Jiafu Zhang <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * prevent quantization being messed-up with multiple processes Signed-off-by: Jiafu Zhang <[email protected]> * fix merge error Signed-off-by: Jiafu Zhang <[email protected]> * rename py to sh Signed-off-by: Jiafu Zhang <[email protected]> * fix formatting issue Signed-off-by: Jiafu Zhang <[email protected]> * fix formatting issue Signed-off-by: Jiafu Zhang <[email protected]> * fix merge error Signed-off-by: JoshuaL3000 <[email protected]> * add vllm-ns ci Signed-off-by: Jiafu Zhang <[email protected]> * remove unnecessary logs Signed-off-by: Jiafu Zhang <[email protected]> * remove some debug code Signed-off-by: Jiafu Zhang <[email protected]> * add '--privileged' to docker run Signed-off-by: Jiafu Zhang <[email protected]> * set unlimited max lock memory for neural speed engine Signed-off-by: Jiafu Zhang <[email protected]> * llama-3-8B support Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]> * add llama3 for plain cpu Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * add inference_engine resource and app_router resource to distinct engine worker and router worker since they have different resource config Signed-off-by: Jiafu Zhang <[email protected]> * enhanced benchmark script to support IDC test data Signed-off-by: Jiafu Zhang <[email protected]> * updated ray startup script to add resources for app_router and inference_engine Signed-off-by: Jiafu Zhang <[email protected]> * fix first token latency and next token latency issue in open-ai mode in benchmark script Signed-off-by: Jiafu Zhang <[email protected]> * updated ray startup script to add resources for app_router and inference_engine Signed-off-by: Jiafu Zhang <[email protected]> * addressed some review comments Signed-off-by: Jiafu Zhang <[email protected]> * fix lint issue Signed-off-by: Jiafu Zhang <[email protected]> * address review comment by getting number of threads from ray num-cpus and setting threads for cases not with ray Signed-off-by: Jiafu Zhang <[email protected]> --------- Signed-off-by: Jiafu Zhang <[email protected]> Signed-off-by: JoshuaL3000 <[email protected]> Co-authored-by: KepingYan <[email protected]> Co-authored-by: JoshuaL3000 <[email protected]>
intel · Jul 29, 2024 · 2eb6c2b · 2eb6c2b
1 parent f5e42a5
commit 2eb6c2b
Show file tree

Hide file tree

Showing 185 changed files with 81,963 additions and 72 deletions.
diff --git a/.github/license/header_exclude_files.txt b/.github/license/header_exclude_files.txt
@@ -0,0 +1 @@
+vllm-ext/vllm/extension/ns/__init__.py
diff --git a/.github/workflows/workflow_inference.yml b/.github/workflows/workflow_inference.yml
@@ -34,7 +34,7 @@ jobs:
     name: inference
     strategy:
       matrix:
-        model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, gemma-2b, deepseek-coder-33b-instruct]
+        model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, llama-2-7b-chat-hf-vllm-ns, gemma-2b, deepseek-coder-33b-instruct]
         isPR:
           - ${{inputs.ci_type == 'pr'}}
 
@@ -97,7 +97,11 @@ jobs:
         run: |
           TARGET=${{steps.target.outputs.target}}
           source dev/scripts/ci-functions.sh
-          strat_ray ${TARGET}
+          if [[ "$TARGET" == *ns ]]; then
+            start_ray ${TARGET} 1
+          else
+            start_ray ${TARGET}
+          fi
 
       - name: Run Inference Test
         run: |

diff --git a/.github/workflows/workflow_inference_gaudi2.yml b/.github/workflows/workflow_inference_gaudi2.yml
@@ -104,7 +104,7 @@ jobs:
           # check and remove exited container
           cid=$(docker ps -a -q --filter "name=${TARGET}")
           if [[ ! -z "$cid" ]]; then docker rm $cid; fi
-          docker run -tid --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
+          docker run -tid --privileged --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
       - name: Start Ray Cluster
         run: |
           TARGET=${{steps.target.outputs.target}}

diff --git a/.github/workflows/workflow_test_benchmark.yml b/.github/workflows/workflow_test_benchmark.yml
@@ -80,7 +80,7 @@ jobs:
           # check and remove exited container
           cid=$(docker ps -a -q --filter "name=${TARGET}")
           if [[ ! -z "$cid" ]]; then docker rm $cid; fi
-          docker run -tid -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest
+          docker run -tid --privileged -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest
 
       - name: Start Ray Cluster
         run: |

diff --git a/.github/workflows/workflow_tests.yml b/.github/workflows/workflow_tests.yml
@@ -176,7 +176,7 @@ jobs:
         run: |
           TARGET=${{steps.target.outputs.target}}
           source dev/scripts/ci-functions.sh
-          strat_ray ${TARGET}
+          start_ray ${TARGET}
 
       - name: Run Tests
         run: |

diff --git a/.gitignore b/.gitignore
@@ -5,3 +5,9 @@ build/lib/
 *.json
 *.txt
 *.egg-info
+.eggs
+*.log
+*.so
+*.ninja_log
+build/
+runtime_outs/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -7,6 +7,12 @@ repos:
     hooks:
       - id: ruff
         args: [ --fix, --exit-non-zero-on-fix, --ignore=E402, --ignore=E501, --ignore=E731, --ignore=F401]
+        exclude: |
+            (?x)^(
+              examples/inference/vllm/ray-vllm-examples/llm.py|
+              vllm-ext/vllm/extension/ns/__init__.py|
+            )$
+
 
   # Black needs to be ran after ruff with --fix
   - repo: https://github.com/psf/black
@@ -18,7 +24,18 @@ repos:
     rev: "v0.981"
     hooks:
       - id: mypy
-        exclude: tests
+        exclude: |
+          (?x)^(
+            tests|
+            vllm-ext/vllm/extension/ns/model/ns_loader.py|
+            vllm-ext/vllm/extension/ns/kv_cache/ns_cache.py|
+            vllm-ext/inference_engine/python/inference_engine/|
+            vllm-ext/setup.py|
+            examples/inference/vllm/ray-vllm-examples/llm.py|
+            llm_on_ray/inference/inference_config.py|
+            vllm-ext/vllm/extension/ns/
+          )
+        
         additional_dependencies:
           - mypy-extensions
           - pydantic==1.10.0