Highlights
This release features approximately 660 commits from 251 contributors (86 new contributors).
Breaking Changes:
- Async scheduling is now enabled by default - Users who experience issues can disable with
--no-async-scheduling.
- Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
- PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
- Deprecated quantization schemes have been removed (#31688, #31285).
- When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).
Key Improvements:
- Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
- gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
--max-model-len auto (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.
- Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying
VLLM_LOG_MODEL_INSPECTION=1 or by simply printing the LLM object.
- Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143),
logit_bias/allowed_token_ids/min_tokens support (#32163).
- Please note that Model Runner V2 is still experimental and disabled by default.
Model Support
New Model Architectures:
- Grok-2 with tiktoken tokenizer (#31847)
- LFM2-VL vision-language model (#31758)
- MiMo-V2-Flash (#30836)
- openPangu MoE (#28775)
- IQuestCoder (#31575)
- Nemotron Parse 1.1 (#30864)
- GLM-ASR audio (#31436)
- Isaac vision model v0.1/v0.2 (#28367, #31550)
- Kanana-1.5-v-3b-instruct (#29384)
- K-EXAONE-236B-A23B MoE (#31621)
LoRA Support Expansion:
- Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
- DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
- Vision LoRA mm_processor_cache support (#31927)
- MoE expert base_layer loading (#31104)
Model Enhancements:
- Qwen3-VL as reranker (#31890)
- DeepSeek v3.2 chat prefix completion (#31147)
- GLM-4.5/GLM-4.7
enable_thinking: false (#31788)
- Ernie4.5-VL video timestamps (#31274)
- Score template expansion (#31335)
- LLaMa4 vision encoder compilation (#30709)
- NemotronH quantized attention (#31898)
Engine Core
- Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
- Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
- Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
- Mamba2 prefix cache optimization (#28047)
- Batch invariant LoRA (#30097)
- LoRA name in BlockStored for KV-cache reconstruction (#27577)
- Request ID collision prevention (#27987)
- Dense model DP without overhead (#30739)
- Async + spec decode penalties/bad_words (#30495)
Hardware & Performance
CUTLASS MoE Optimizations:
- 2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
- 5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
- Fused SiLU+Mul+Quant for NVFP4 (#31832)
- NVFP4 stride fusion (#31837)
Other Performance:
- GDN attention decode speedup (Qwen3-Next) (#31722)
- Fused RoPE + MLA KV-cache write (#25774)
- Sliding window attention optimization (#31984)
- FlashInfer DeepGEMM swapAB SM90 (#29213)
- Unpermute-aware fused MoE + small-batch fallback (#29354)
- GDN Attention blocking copy removal (#31167)
- FusedMoE LoRA small rank performance (#32019)
- EPLB numpy optimization (#29499)
- FlashInfer rotary for DeepSeek (#30729)
- Vectorized activations (#29512)
- NUMA interleaved memory (#30800)
- Async spec decode logprobs (#31336)
Hardware Configs:
- SM103 support (#30705, #31150)
- B300 Blackwell MoE configs (#30629)
- Qwen3-Next FP8 CUTLASS configs (#29553)
- Qwen3Moe B200 Triton configs (#31448)
- GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
- MiniMax-M2/M2.1 QKNorm (#31493)
- NVFP4 small batch tuning (#30897)
Platform:
- ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
- XPU: FP8 streaming quant (#30944), custom workers (#30935)
- CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
- TPU: tpu-inference path (#30808), Sophgo docs (#30949)
Large Scale Serving
- XBO (Extended Dual-Batch Overlap) (#30120)
- NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
- NIXL heterogeneous BlockSize/kv_layout (#30275)
- Cross-layers KV layout for MultiConnector (#30761)
- Mooncake protocol expansion (#30133)
- LMCache KV cache registration (#31397)
- EPLB default all2all backend (#30559)
Quantization
- Marlin for Turing (sm75) (#29901, #31000)
- Quark int4-fp8 w4a8 MoE (#30071)
- MXFP4 W4A16 dense models (#31926)
- ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
- ModelOpt KV cache quantization update (#31895)
- NVFP4 Marlin for NVFP4A16 MoEs (#30881)
- Static quant all group shapes (#30833)
- Default MXFP4 LoRA backend: Marlin (#30598)
- compressed-tensors 0.13.0 (#30799)
API & Frontend
New Features:
- gRPC server (#30190)
--max-model-len auto (#29431)
- Model inspection view (#29450)
- Offline FastAPI docs (#30184)
attention_config in LLM() (#30710)
- MFU metrics (#30738)
- Iteration logging + NVTX (#31193)
reasoning_effort parameter (#31956)
Tool Calling:
- FunctionGemma parser (#31218)
- GLM-4.7 parser (#30876)
- Kimi K2 update (#31207)
CLI:
-ep for --enable-expert-parallel (#30890)
- Complete help messages (#31226)
- Bench serve auto-discovery +
--input-len (#30816)
- Spec decode acceptance stats (#31739)
--enable-log-deltas (renamed) (#32020)
--default-chat-template-kwargs (#31343)
API:
/server_info env info (#31899)
- MCP streaming in Responses API (#31761)
/embeddings continue_final_message (#31497)
- Reranking score templates (#30550)
- Chat template warmup (#30700)
- Configurable handshake timeout (#27444)
- Better 500 errors (#20610)
- Worker init logging (#29493)
- Bench error reporting (#31808)
- Corrupted video recovery (#29197)
- Spec-decode param validation (#31982)
- Validation error metadata (#30134)
Security
- Prevent token leaks in crash logs (#30751)
weights_only=True in torch.load (#32045)
Dependencies
- PyTorch 2.9.1 (#28495)
- compressed-tensors 0.13.0 (#30799)
- CUDA 13 LMCache/NIXL in Docker (#30913)
- Configurable NVSHMEM version (#30732)
Bug Fixes (User-Facing)
- Invalid UTF-8 tokens (#28874)
- CPU RoPE gibberish with
--enforce-eager (#31643)
- Tool call streaming finish chunk (#31438)
- Encoder cache leak CPU scheduling stuck (#31857)
- Engine crash: tools + response_format (#32127)
- Voxtral transcription API (#31388)
- Safetensors download optimization (#30537)
Deprecations
- Deprecated quantization schemes removed (#31688, #31285)
seed_everything deprecated (#31659)
Documentation
- vllm-metal plugin docs (#31174)
- Claude Code example (#31188)
- CustomOp developer guide (#30886)
New Contributors 🎉
- @penfree made their first contribution in https://github.com/vllm-project/vllm/pull/30237
- @jiangkuaixue123 made their first contribution in https://github.com/vllm-project/vllm/pull/30120
- @jr-shen made their first contribution in https://github.com/vllm-project/vllm/pull/29663
- @grzegorz-k-karch made their first contribution in https://github.com/vllm-project/vllm/pull/30795
- @shanjiaz made their first contribution in https://github.com/vllm-project/vllm/pull/30799
- @Somoku made their first contribution in https://github.com/vllm-project/vllm/pull/29569
- @baoqian426 made their first contribution in https://github.com/vllm-project/vllm/pull/30841
- @SongDI911 made their first contribution in https://github.com/vllm-project/vllm/pull/30852
- @www-spam made their first contribution in https://github.com/vllm-project/vllm/pull/30827
- @Xunzhuo made their first contribution in https://github.com/vllm-project/vllm/pull/30844
- @TheCodeWrangler made their first contribution in https://github.com/vllm-project/vllm/pull/30700
- @SungMinCho made their first contribution in https://github.com/vllm-project/vllm/pull/30738
- @sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
- @wzyrrr made their first contribution in https://github.com/vllm-project/vllm/pull/30949
- @navmarri14 made their first contribution in https://github.com/vllm-project/vllm/pull/30629
- @HaloWorld made their first contribution in https://github.com/vllm-project/vllm/pull/30867
- @jeffreywang-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/31013
- @AmeenP made their first contribution in https://github.com/vllm-project/vllm/pull/31093
- @westers made their first contribution in https://github.com/vllm-project/vllm/pull/31071
- @CedricHwong made their first contribution in https://github.com/vllm-project/vllm/pull/30957
- @c0de128 made their first contribution in https://github.com/vllm-project/vllm/pull/31114
- @Bounty-hunter made their first contribution in https://github.com/vllm-project/vllm/pull/30242
- @jzakrzew made their first contribution in https://github.com/vllm-project/vllm/pull/30550
- @1643661061leo made their first contribution in https://github.com/vllm-project/vllm/pull/30760
- @NickCao made their first contribution in https://github.com/vllm-project/vllm/pull/30070
- @amithkk made their first contribution in https://github.com/vllm-project/vllm/pull/31212
- @gateremark made their first contribution in https://github.com/vllm-project/vllm/pull/31218
- @Tiiiktak made their first contribution in https://github.com/vllm-project/vllm/pull/31274
- @oscardev256 made their first contribution in https://github.com/vllm-project/vllm/pull/28367
- @Jzz1943 made their first contribution in https://github.com/vllm-project/vllm/pull/31448
- @mratsim made their first contribution in https://github.com/vllm-project/vllm/pull/31407
- @twjww made their first contribution in https://github.com/vllm-project/vllm/pull/31445
- @amittell made their first contribution in https://github.com/vllm-project/vllm/pull/31438
- @ricky-chaoju made their first contribution in https://github.com/vllm-project/vllm/pull/30184
- @effortprogrammer made their first contribution in https://github.com/vllm-project/vllm/pull/31343
- @ZT-AIA made their first contribution in https://github.com/vllm-project/vllm/pull/31408
- @rogerxfeng8 made their first contribution in https://github.com/vllm-project/vllm/pull/31522
- @kevin-pw made their first contribution in https://github.com/vllm-project/vllm/pull/31497
- @vintipandey made their first contribution in https://github.com/vllm-project/vllm/pull/31505
- @SameerAsal made their first contribution in https://github.com/vllm-project/vllm/pull/31520
- @Dylan1229 made their first contribution in https://github.com/vllm-project/vllm/pull/31546
- @reaganjlee made their first contribution in https://github.com/vllm-project/vllm/pull/29105
- @zhima771 made their first contribution in https://github.com/vllm-project/vllm/pull/31569
- @jayhemnani9910 made their first contribution in https://github.com/vllm-project/vllm/pull/31513
- @Tmn07 made their first contribution in https://github.com/vllm-project/vllm/pull/31572
- @vsourirajan made their first contribution in https://github.com/vllm-project/vllm/pull/31549
- @labAxiaoming made their first contribution in https://github.com/vllm-project/vllm/pull/31601
- @massif-01 made their first contribution in https://github.com/vllm-project/vllm/pull/31604
- @PHOEBEMOON0802 made their first contribution in https://github.com/vllm-project/vllm/pull/31147
- @tpopp made their first contribution in https://github.com/vllm-project/vllm/pull/29993
- @ppppqp made their first contribution in https://github.com/vllm-project/vllm/pull/31620
- @zzzzwwjj made their first contribution in https://github.com/vllm-project/vllm/pull/31674
- @Catacomba made their first contribution in https://github.com/vllm-project/vllm/pull/30322
- @kunpengW-code made their first contribution in https://github.com/vllm-project/vllm/pull/31669
- @johncalesp made their first contribution in https://github.com/vllm-project/vllm/pull/28874
- @BlankRH made their first contribution in https://github.com/vllm-project/vllm/pull/31800
- @guicho271828 made their first contribution in https://github.com/vllm-project/vllm/pull/20610
- @ReinforcedKnowledge made their first contribution in https://github.com/vllm-project/vllm/pull/31055
- @vSeamar made their first contribution in https://github.com/vllm-project/vllm/pull/29197
- @A1c0r-Z made their first contribution in https://github.com/vllm-project/vllm/pull/31656
- @MrIceCreamMan made their first contribution in https://github.com/vllm-project/vllm/pull/31465
- @tianshu-Michael-yu made their first contribution in https://github.com/vllm-project/vllm/pull/31841
- @weiyu0824 made their first contribution in https://github.com/vllm-project/vllm/pull/30808
- @andyl98 made their first contribution in https://github.com/vllm-project/vllm/pull/31757
- @JaredforReal made their first contribution in https://github.com/vllm-project/vllm/pull/31779
- @katec846 made their first contribution in https://github.com/vllm-project/vllm/pull/29213
- @kfirtoledo made their first contribution in https://github.com/vllm-project/vllm/pull/30761
- @Ayobami-00 made their first contribution in https://github.com/vllm-project/vllm/pull/31868
- @ShaanveerS made their first contribution in https://github.com/vllm-project/vllm/pull/31825
- @Zyyeric made their first contribution in https://github.com/vllm-project/vllm/pull/31652
- @wangshangsam made their first contribution in https://github.com/vllm-project/vllm/pull/31775
- @devbyteai made their first contribution in https://github.com/vllm-project/vllm/pull/31536
- @BJWang-ant made their first contribution in https://github.com/vllm-project/vllm/pull/31719
- @dangoldbj made their first contribution in https://github.com/vllm-project/vllm/pull/31847
- @maylikenoother made their first contribution in https://github.com/vllm-project/vllm/pull/31610
- @yxing-bj made their first contribution in https://github.com/vllm-project/vllm/pull/31575
- @xbfs made their first contribution in https://github.com/vllm-project/vllm/pull/31948
- @RunkaiTao made their first contribution in https://github.com/vllm-project/vllm/pull/29354
- @AkshatSh made their first contribution in https://github.com/vllm-project/vllm/pull/31550
- @frelam made their first contribution in https://github.com/vllm-project/vllm/pull/31857
- @shyeh25 made their first contribution in https://github.com/vllm-project/vllm/pull/31617
- @andikarachman made their first contribution in https://github.com/vllm-project/vllm/pull/32092
- @minimAluminiumalism made their first contribution in https://github.com/vllm-project/vllm/pull/32158
- @andyzhangx made their first contribution in https://github.com/vllm-project/vllm/pull/32185
- @sanghoon-yn made their first contribution in https://github.com/vllm-project/vllm/pull/31956
- @potatosalad made their first contribution in https://github.com/vllm-project/vllm/pull/32212
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.13.0...v0.14.0