Release Release v0.4.1 · sgl-project/sglang

Highlights

We're excited to announce SGLang v0.4.1, which now supports DeepSeek V3 - currently the strongest open-source LLM, even surpassing GPT-4o.

The SGLang and DeepSeek teams worked together to get DeepSeek V3 FP8 running on NVIDIA and AMD GPU from day one. We've also supported MLA optimization and DP attention before, making SGLang one of the best open-source LLM engines for running DeepSeek models.

Special thanks to Meituan's Search & Recommend Platform Team @ispobock @HandH1998 and Baseten's Model Performance Team for implementing the model, and DataCrunch for providing GPU resources.
Various improvements to the cache-aware sglang router, torchao integration, server termination
Added a standalone package sgl-kernel for supporting more custom kernels in the code base.

What's Changed

Adding SGLang FP8 Utils by @HaiShaw in #2348
docs: add SGLang v0.4 blog by @zhyncs in #2341
MLA prefill w/o weight absorption by @ispobock in #2349
Check gpu availability at server args creation by @MrAta in #2340
minor: limit the range of vllm versions by @zhyncs in #2350
Fix Docs CI When Compile Error by @zhaochenyang20 in #2323
Add Docs For SGLang Native Router by @zhaochenyang20 in #2308
Make torch TP composable with torch.compile by @kwen2501 in #2352
move apply_torchao_config_ to model_runner by @jerryzh168 in #2342
[Minor] Code style improvements by @merrymercy in #2355
Fix AWQ with enable MLA by @ispobock in #2364
MoE Expert Parallel by @xiaobochen123 in #2371
Move FP8 to SGLang by @zhyncs in #2370
optimize cuda graph max_bs_settings on low-end gpus by @BBuf in #2360
Add more support for intel Gaudi accelerators by @YangQun1 in #2357
[router] support /add_worker api by @ByronHsu in #2369
docs: update adoption (Meituan) by @zhyncs in #2373
Use proc.join instead of busy waiting by @merrymercy in #2374
docs: Improve instructions for supporting new models by @vchzls in #2363
Fix the overlap for xgrammar by @merrymercy in #2377
Release v0.4.0.post1 by @merrymercy in #2375
[Router] remove duplicate char count by @ByronHsu in #2378
[router] add remove tenant method in the radix tree by @ByronHsu in #2379
[router] Add remove worker api by @ByronHsu in #2380
fix: resolve fp8 moe issue by @zhyncs in #2387
fix: update xgrammar v0.1.6 by @zhyncs in #2390
Fp8 MoE optimizations on AMD by @HaiShaw in #2388
minor: update killall script by @zhyncs in #2391
[router] Health check on worker before added to the router by @ByronHsu in #2392
Fix shape error that occurred when loading lora weight of gemma2 model. by @upskyy in #2330
nit: Remove busy waiting on scheduler by @rkooo567 in #2382
Optimize Triton decoding kernel for long context by @ispobock in #2394
Update killall_sglang.sh by @merrymercy in #2397
Remove unused vars in the triton backend by @ispobock in #2401
Fix a bug with logprob streaming + chunked prefill by @merrymercy in #2403
fix: specify dtype with begin_forward aka plan by @zhyncs in #2404
Fix recv_requests by @merrymercy in #2405
minor: update correct measurement unit by @zhyncs in #2406
feat: support custom task runner by @zhyncs in #2407
minor: add random use case by @zhyncs in #2408
minor: add random flashinfer vs triton use case by @zhyncs in #2409
Simplify stream_output by @merrymercy in #2398
[router] Improve cleanup logic by @ByronHsu in #2411
[Router] fix interrupt from terminal by @ByronHsu in #2413
[router] defer health checking to router init by @ByronHsu in #2393
reduce watchdog interval to 5s by @ByronHsu in #2410
Add a unittest for fused_moe by @BBuf in #2416
[Minor] Improve code style by @merrymercy in #2419
[Minor] Improve code style by @merrymercy in #2422
[feat] Enable chunked prefill for llava-onevision by @Ying1123 in #2412
Typo fix in router.md by @adarshxs in #2424
feat: support sgl-kernel PyPI by @zhyncs in #2433
fix: use manylinux2014_x86_64 tag by @zhyncs in #2434
fix: compatible with PEP 440 by @zhyncs in #2435
[router] Refactor: decouple select and send stage by @ByronHsu in #2440
[router] Use borrow if possible to save cost by @ByronHsu in #2441
Make torch TP composable with torchao by @kwen2501 in #2436
chore: update ao v0.7.0 by @zhyncs in #2447
decoding attention kernel benchmark by @bjmsong in #2425
Fix model loader for more quantization formats by @merrymercy in #2448
Fix warmup in bench_offline_throughput.py by @merrymercy in #2449
Add support for IBM Granite 3.x models by @frreiss in #2437
[router] Add retries based fault tolerance by @ByronHsu in #2452
[router] remove main.rs because only lib.rs is used for py binding by @ByronHsu in #2453
[Core] in batch prefix caching by delay scheduling by @rkooo567 in #2442
[router] Update doc for dynamic scaling and fault tolerance by @ByronHsu in #2454
[router] Release router 0.1.0 with dynamic scaling and fault tolerance by @ByronHsu in #2455
Make request payload size configurable by @MrAta in #2444
Include version info into the router package by @MrAta in #2456
Bump sglang-router to 0.1.1 by @MrAta in #2459
chore: bump v0.0.2 for sgl-kernel by @zhyncs in #2462
minor: update pypi tag by @zhyncs in #2463
fix: set runtime path by @zhyncs in #2466
Rename rust folder to sgl-router by @MrAta in #2464
feat: support dev image by @zhyncs in #2469
[Minor] Fix grok model loader by @merrymercy in #2473
Fix correctness issue for triton decoding kernel by @ispobock in #2479
format: add clang-format for sgl-kernel by @zhyncs in #2483
Remove cuda graph batch size adjustment for dp attention by @ispobock in #2484
hotfix: checking for HIP by @zhyncs in #2485
sgl-kernel adapt tensorrt llm custom allreduce by @yizhang2077 in #2481
fix typo by @zhyncs in #2487
[Benchmark] add a benchmark for hf/vllm/sglang rmsnorm by @BBuf in #2486
fix moe-ep accuracy issue for fp8 by @xiaobochen123 in #2489
minor: update flashinfer nightly by @zhyncs in #2490
Small fixes for torchao quant by @jerryzh168 in #2476
Simplify pytorch sampling kernel and logit processor by @merrymercy in #2491
Temporarily disable unit test of torch native attention backend by @merrymercy in #2492
Revert "Small fixes for torchao quant" by @merrymercy in #2493
Add a benchmark script for in-batch prefix caching by @merrymercy in #2494
Small fix for the order of apply_torchao_config by @merrymercy in #2495
benchmark decoding attention kernel with cudnn by @bjmsong in #2467
Clean up GPU memory after killing sglang processes by @MrAta in #2457
ROCm support for sglang.check_env by @hliuca in #2426
Add lora_path to chat completion by @ccchow in #2438
Fix openai protocols and pass top_k, min_p by @merrymercy in #2499
Update readme by @merrymercy in #2500
feat: add llama3 eval by @zhyncs in #2515
docs: update README by @zhyncs in #2516
fix: continue to use flashinfer 0.1.6 temporarily by @zhyncs in #2517
fix followup #2517 by @zhyncs in #2524
Add integration with gemlite weight only quant by @jerryzh168 in #2528
chore: bump v0.4.0.post2 by @zhyncs in #2525
fix #2528 by @zhyncs in #2541
Add lora_paths to v1_chat_generate_request by @ccchow in #2529
docs: update sponsorship (DataCrunch) by @zhyncs in #2523
[kernel optimize] benchmark write_req_to_token_pool_triton and optimize kernel by @BBuf in #2509
A better aio rwlock that guarantees the order by @merrymercy in #2547
Updated documentation for Grammar Backend by @shuaills in #2545
Fix gemlite import by @merrymercy in #2553
Reorg moe code by @ispobock in #2563
[Bench] Flush cache before benchmarking by @Ying1123 in #2566
Refactor MoE by @HandH1998 in #2575
fix moe_align_block_size_kernel for shared memory issue by @zhyncs in #2579
chore: bump 0.0.2.post8 for sgl-kernel by @zhyncs in #2580
use sgl-kernel moe_align_block_size by @zhyncs in #2581
chore: bump v0.4.1 by @zhyncs in #2582

New Contributors

@vchzls made their first contribution in #2363
@upskyy made their first contribution in #2330
@rkooo567 made their first contribution in #2382
@adarshxs made their first contribution in #2424
@frreiss made their first contribution in #2437
@ccchow made their first contribution in #2438
@shuaills made their first contribution in #2545

Full Changelog: v0.4.0...v0.4.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.4.1

Highlights

What's Changed

New Contributors

Contributors