Highlights
-
We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.
The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.
Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
-
Various improvements to the cache-aware sglang router, torchao integration, server termination
-
Added a standalone package sgl-kernel for supporting more custom kernels in the code base.
What's Changed
- Adding SGLang FP8 Utils by @HaiShaw in #2348
- docs: add SGLang v0.4 blog by @zhyncs in #2341
- MLA prefill w/o weight absorption by @ispobock in #2349
- Check gpu availability at server args creation by @MrAta in #2340
- minor: limit the range of vllm versions by @zhyncs in #2350
- Fix Docs CI When Compile Error by @zhaochenyang20 in #2323
- Add Docs For SGLang Native Router by @zhaochenyang20 in #2308
- Make torch TP composable with torch.compile by @kwen2501 in #2352
- move apply_torchao_config_ to model_runner by @jerryzh168 in #2342
- [Minor] Code style improvements by @merrymercy in #2355
- Fix AWQ with enable MLA by @ispobock in #2364
- MoE Expert Parallel by @xiaobochen123 in #2371
- Move FP8 to SGLang by @zhyncs in #2370
- optimize cuda graph max_bs_settings on low-end gpus by @BBuf in #2360
- Add more support for intel Gaudi accelerators by @YangQun1 in #2357
- [router] support
/add_worker
api by @ByronHsu in #2369 - docs: update adoption (Meituan) by @zhyncs in #2373
- Use proc.join instead of busy waiting by @merrymercy in #2374
- docs: Improve instructions for supporting new models by @vchzls in #2363
- Fix the overlap for xgrammar by @merrymercy in #2377
- Release v0.4.0.post1 by @merrymercy in #2375
- [Router] remove duplicate char count by @ByronHsu in #2378
- [router] add remove tenant method in the radix tree by @ByronHsu in #2379
- [router] Add remove worker api by @ByronHsu in #2380
- fix: resolve fp8 moe issue by @zhyncs in #2387
- fix: update xgrammar v0.1.6 by @zhyncs in #2390
- Fp8 MoE optimizations on AMD by @HaiShaw in #2388
- minor: update killall script by @zhyncs in #2391
- [router] Health check on worker before added to the router by @ByronHsu in #2392
- Fix shape error that occurred when loading lora weight of gemma2 model. by @upskyy in #2330
- nit: Remove busy waiting on scheduler by @rkooo567 in #2382
- Optimize Triton decoding kernel for long context by @ispobock in #2394
- Update killall_sglang.sh by @merrymercy in #2397
- Remove unused vars in the triton backend by @ispobock in #2401
- Fix a bug with logprob streaming + chunked prefill by @merrymercy in #2403
- fix: specify dtype with begin_forward aka plan by @zhyncs in #2404
- Fix recv_requests by @merrymercy in #2405
- minor: update correct measurement unit by @zhyncs in #2406
- feat: support custom task runner by @zhyncs in #2407
- minor: add random use case by @zhyncs in #2408
- minor: add random flashinfer vs triton use case by @zhyncs in #2409
- Simplify stream_output by @merrymercy in #2398
- [router] Improve cleanup logic by @ByronHsu in #2411
- [Router] fix interrupt from terminal by @ByronHsu in #2413
- [router] defer health checking to router init by @ByronHsu in #2393
- reduce watchdog interval to 5s by @ByronHsu in #2410
- Add a unittest for fused_moe by @BBuf in #2416
- [Minor] Improve code style by @merrymercy in #2419
- [Minor] Improve code style by @merrymercy in #2422
- [feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2412
- Typo fix in router.md by @adarshxs in #2424
- feat: support sgl-kernel PyPI by @zhyncs in #2433
- fix: use manylinux2014_x86_64 tag by @zhyncs in #2434
- fix: compatible with PEP 440 by @zhyncs in #2435
- [router] Refactor: decouple select and send stage by @ByronHsu in #2440
- [router] Use borrow if possible to save cost by @ByronHsu in #2441
- Make torch TP composable with torchao by @kwen2501 in #2436
- chore: update ao v0.7.0 by @zhyncs in #2447
- decoding attention kernel benchmark by @bjmsong in #2425
- Fix model loader for more quantization formats by @merrymercy in #2448
- Fix warmup in bench_offline_throughput.py by @merrymercy in #2449
- Add support for IBM Granite 3.x models by @frreiss in #2437
- [router] Add retries based fault tolerance by @ByronHsu in #2452
- [router] remove main.rs because only lib.rs is used for py binding by @ByronHsu in #2453
- [Core] in batch prefix caching by delay scheduling by @rkooo567 in #2442
- [router] Update doc for dynamic scaling and fault tolerance by @ByronHsu in #2454
- [router] Release router 0.1.0 with dynamic scaling and fault tolerance by @ByronHsu in #2455
- Make request payload size configurable by @MrAta in #2444
- Include version info into the router package by @MrAta in #2456
- Bump sglang-router to 0.1.1 by @MrAta in #2459
- chore: bump v0.0.2 for sgl-kernel by @zhyncs in #2462
- minor: update pypi tag by @zhyncs in #2463
- fix: set runtime path by @zhyncs in #2466
- Rename rust folder to sgl-router by @MrAta in #2464
- feat: support dev image by @zhyncs in #2469
- [Minor] Fix grok model loader by @merrymercy in #2473
- Fix correctness issue for triton decoding kernel by @ispobock in #2479
- format: add clang-format for sgl-kernel by @zhyncs in #2483
- Remove cuda graph batch size adjustment for dp attention by @ispobock in #2484
- hotfix: checking for HIP by @zhyncs in #2485
- sgl-kernel adapt tensorrt llm custom allreduce by @yizhang2077 in #2481
- fix typo by @zhyncs in #2487
- [Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by @BBuf in #2486
- fix moe-ep accuracy issue for fp8 by @xiaobochen123 in #2489
- minor: update flashinfer nightly by @zhyncs in #2490
- Small fixes for torchao quant by @jerryzh168 in #2476
- Simplify pytorch sampling kernel and logit processor by @merrymercy in #2491
- Temporarily disable unit test of torch native attention backend by @merrymercy in #2492
- Revert "Small fixes for torchao quant" by @merrymercy in #2493
- Add a benchmark script for in-batch prefix caching by @merrymercy in #2494
- Small fix for the order of apply_torchao_config by @merrymercy in #2495
- benchmark decoding attention kernel with cudnn by @bjmsong in #2467
- Clean up GPU memory after killing sglang processes by @MrAta in #2457
- ROCm support for sglang.check_env by @hliuca in #2426
- Add lora_path to chat completion by @ccchow in #2438
- Fix openai protocols and pass top_k, min_p by @merrymercy in #2499
- Update readme by @merrymercy in #2500
- feat: add llama3 eval by @zhyncs in #2515
- docs: update README by @zhyncs in #2516
- fix: continue to use flashinfer 0.1.6 temporarily by @zhyncs in #2517
- fix followup #2517 by @zhyncs in #2524
- Add integration with gemlite weight only quant by @jerryzh168 in #2528
- chore: bump v0.4.0.post2 by @zhyncs in #2525
- fix #2528 by @zhyncs in #2541
- Add lora_paths to v1_chat_generate_request by @ccchow in #2529
- docs: update sponsorship (DataCrunch) by @zhyncs in #2523
- [kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel by @BBuf in #2509
- A better aio rwlock that guarantees the order by @merrymercy in #2547
- Updated documentation for Grammar Backend by @shuaills in #2545
- Fix gemlite import by @merrymercy in #2553
- Reorg moe code by @ispobock in #2563
- [Bench] Flush cache before benchmarking by @Ying1123 in #2566
- Refactor MoE by @HandH1998 in #2575
- fix moe_align_block_size_kernel for shared memory issue by @zhyncs in #2579
- chore: bump 0.0.2.post8 for sgl-kernel by @zhyncs in #2580
- use sgl-kernel moe_align_block_size by @zhyncs in #2581
- chore: bump v0.4.1 by @zhyncs in #2582
New Contributors
- @vchzls made their first contribution in #2363
- @upskyy made their first contribution in #2330
- @rkooo567 made their first contribution in #2382
- @adarshxs made their first contribution in #2424
- @frreiss made their first contribution in #2437
- @ccchow made their first contribution in #2438
- @shuaills made their first contribution in #2545
Full Changelog: v0.4.0...v0.4.1