From 76ddad7d2e559484d0e0ba03b7a9f95840e9832e Mon Sep 17 00:00:00 2001 From: Duyi-Wang Date: Wed, 5 Jun 2024 13:06:03 +0800 Subject: [PATCH] [Version] v1.7.0. (#433) --- CHANGELOG.md | 22 ++++++++++++++++++++++ VERSION | 2 +- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 97c91e16..e58d48ab 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,26 @@ # CHANGELOG +# [Version v1.7.0](https://github.com/intel/xFasterTransformer/releases/tag/v1.7.0) +v1.7.0 - Continuous batching feature supported. + +## Functionality +- Refactor framework to support continuous batching feature. `vllm-xft`, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features. +- Remove FP32 data type option of KV Cache. +- Add `get_env()` python API to get recommended LD_PRELOAD set. +- Add GPU build option for Intel Arc GPU series. +- Exposed the interface of the LLaMA model, including Attention and decoder. + +## Performance +- Update xDNN to release `v1.5.1` +- Baichuan series models supports full FP16 pipline to improve performance. +- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope. +- Kernel implementation of crossAttnByHead. + +## Dependency +- Bump `torch` to `2.3.0`. + +## BUG fix +- Fixed the segament fault error when running with more than 4 ranks. +- Fixed the bugs of core dump && hang when running croos nodes. # [Version v1.6.0](https://github.com/intel/xFasterTransformer/releases/tag/v1.6.0) v1.6.0 - Llama3 and Qwen2 series models supported. diff --git a/VERSION b/VERSION index ce6a70b9..9dbb0c00 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.6.0 \ No newline at end of file +1.7.0 \ No newline at end of file