Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add benchmark run script, visualize script * upd * update multi replicas * use --result-dir to parse results * fix ci proxy * add test ci * add license * fix * fix * add autoscaling config * fix ci * fix ci * add package matplotlib * verify CI test * verify CI test * create assets folder to place pictures * verify CI test * support openai autoscaling * remove * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * update config file * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * integrate vllm and ns Signed-off-by: Jiafu Zhang <[email protected]> * remove .eggs Signed-off-by: Jiafu Zhang <[email protected]> * integration adjustment Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * llm on ray deployed Signed-off-by: Jiafu Zhang <[email protected]> * more doc Signed-off-by: Jiafu Zhang <[email protected]> * more doc for installing vllm ext Signed-off-by: Jiafu Zhang <[email protected]> * bug fix Signed-off-by: Jiafu Zhang <[email protected]> * save Signed-off-by: Jiafu Zhang <[email protected]> * add vllm-ext/requirements.txt Signed-off-by: Jiafu Zhang <[email protected]> * add CMakeLists.txt Signed-off-by: Jiafu Zhang <[email protected]> * changed benchmarks Signed-off-by: Jiafu Zhang <[email protected]> * tuned graph build Signed-off-by: Jiafu Zhang <[email protected]> * graph build time reduced Signed-off-by: Jiafu Zhang <[email protected]> * graph build time reduced Signed-off-by: Jiafu Zhang <[email protected]> * configurable perf stats and copy quant config automatically Signed-off-by: Jiafu Zhang <[email protected]> * save test script Signed-off-by: Jiafu Zhang <[email protected]> * add max_batched_tokens parameter Signed-off-by: Jiafu Zhang <[email protected]> * adjustment and ray-vllm-examples Signed-off-by: Jiafu Zhang <[email protected]> * perf tuned and improved by disable mmap for multiple instances Signed-off-by: Jiafu Zhang <[email protected]> * remove unnecessary thread sync in kernels Signed-off-by: Jiafu Zhang <[email protected]> * merged ns PR 209 7d49516 Signed-off-by: Jiafu Zhang <[email protected]> * change order of loop, batch size first, then iteration Signed-off-by: Jiafu Zhang <[email protected]> * modified some examples Signed-off-by: Jiafu Zhang <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * add more parameters for vllm-ns test Signed-off-by: JoshuaL3000 <[email protected]> * prevent quantization being messed-up with multiple processes Signed-off-by: Jiafu Zhang <[email protected]> * fix merge error Signed-off-by: Jiafu Zhang <[email protected]> * rename py to sh Signed-off-by: Jiafu Zhang <[email protected]> * fix formatting issue Signed-off-by: Jiafu Zhang <[email protected]> * fix formatting issue Signed-off-by: Jiafu Zhang <[email protected]> * fix merge error Signed-off-by: JoshuaL3000 <[email protected]> * add vllm-ns ci Signed-off-by: Jiafu Zhang <[email protected]> * remove unnecessary logs Signed-off-by: Jiafu Zhang <[email protected]> * remove some debug code Signed-off-by: Jiafu Zhang <[email protected]> * add '--privileged' to docker run Signed-off-by: Jiafu Zhang <[email protected]> * set unlimited max lock memory for neural speed engine Signed-off-by: Jiafu Zhang <[email protected]> * llama-3-8B support Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]> * extend token length limit to 8192 for mha (fix) and support different threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]> * add llama3 for plain cpu Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * benchmark idc simple/medium/complex/verycomplex prompts Signed-off-by: Jiafu Zhang <[email protected]> * add inference_engine resource and app_router resource to distinct engine worker and router worker since they have different resource config Signed-off-by: Jiafu Zhang <[email protected]> * enhanced benchmark script to support IDC test data Signed-off-by: Jiafu Zhang <[email protected]> * updated ray startup script to add resources for app_router and inference_engine Signed-off-by: Jiafu Zhang <[email protected]> * fix first token latency and next token latency issue in open-ai mode in benchmark script Signed-off-by: Jiafu Zhang <[email protected]> * updated ray startup script to add resources for app_router and inference_engine Signed-off-by: Jiafu Zhang <[email protected]> * addressed some review comments Signed-off-by: Jiafu Zhang <[email protected]> * fix lint issue Signed-off-by: Jiafu Zhang <[email protected]> * address review comment by getting number of threads from ray num-cpus and setting threads for cases not with ray Signed-off-by: Jiafu Zhang <[email protected]> --------- Signed-off-by: Jiafu Zhang <[email protected]> Signed-off-by: JoshuaL3000 <[email protected]> Co-authored-by: KepingYan <[email protected]> Co-authored-by: JoshuaL3000 <[email protected]>
- Loading branch information