FList

Highlights

This release features approximately 660 commits from 251 contributors (86 new contributors).

Breaking Changes:

  • Async scheduling is now enabled by default - Users who experience issues can disable with --no-async-scheduling.
    • Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
  • PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
  • Deprecated quantization schemes have been removed (#31688, #31285).
  • When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).

Key Improvements:

  • Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
  • gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
  • --max-model-len auto (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.
  • Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying VLLM_LOG_MODEL_INSPECTION=1 or by simply printing the LLM object.
  • Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143), logit_bias/allowed_token_ids/min_tokens support (#32163).
    • Please note that Model Runner V2 is still experimental and disabled by default.

Model Support

New Model Architectures:

  • Grok-2 with tiktoken tokenizer (#31847)
  • LFM2-VL vision-language model (#31758)
  • MiMo-V2-Flash (#30836)
  • openPangu MoE (#28775)
  • IQuestCoder (#31575)
  • Nemotron Parse 1.1 (#30864)
  • GLM-ASR audio (#31436)
  • Isaac vision model v0.1/v0.2 (#28367, #31550)
  • Kanana-1.5-v-3b-instruct (#29384)
  • K-EXAONE-236B-A23B MoE (#31621)

LoRA Support Expansion:

  • Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
  • DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
  • Vision LoRA mm_processor_cache support (#31927)
  • MoE expert base_layer loading (#31104)

Model Enhancements:

  • Qwen3-VL as reranker (#31890)
  • DeepSeek v3.2 chat prefix completion (#31147)
  • GLM-4.5/GLM-4.7 enable_thinking: false (#31788)
  • Ernie4.5-VL video timestamps (#31274)
  • Score template expansion (#31335)
  • LLaMa4 vision encoder compilation (#30709)
  • NemotronH quantized attention (#31898)

Engine Core

  • Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
  • Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
  • Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
  • Mamba2 prefix cache optimization (#28047)
  • Batch invariant LoRA (#30097)
  • LoRA name in BlockStored for KV-cache reconstruction (#27577)
  • Request ID collision prevention (#27987)
  • Dense model DP without overhead (#30739)
  • Async + spec decode penalties/bad_words (#30495)

Hardware & Performance

CUTLASS MoE Optimizations:

  • 2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
  • 5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
  • Fused SiLU+Mul+Quant for NVFP4 (#31832)
  • NVFP4 stride fusion (#31837)

Other Performance:

  • GDN attention decode speedup (Qwen3-Next) (#31722)
  • Fused RoPE + MLA KV-cache write (#25774)
  • Sliding window attention optimization (#31984)
  • FlashInfer DeepGEMM swapAB SM90 (#29213)
  • Unpermute-aware fused MoE + small-batch fallback (#29354)
  • GDN Attention blocking copy removal (#31167)
  • FusedMoE LoRA small rank performance (#32019)
  • EPLB numpy optimization (#29499)
  • FlashInfer rotary for DeepSeek (#30729)
  • Vectorized activations (#29512)
  • NUMA interleaved memory (#30800)
  • Async spec decode logprobs (#31336)

Hardware Configs:

  • SM103 support (#30705, #31150)
  • B300 Blackwell MoE configs (#30629)
  • Qwen3-Next FP8 CUTLASS configs (#29553)
  • Qwen3Moe B200 Triton configs (#31448)
  • GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
  • MiniMax-M2/M2.1 QKNorm (#31493)
  • NVFP4 small batch tuning (#30897)

Platform:

  • ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
  • XPU: FP8 streaming quant (#30944), custom workers (#30935)
  • CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
  • TPU: tpu-inference path (#30808), Sophgo docs (#30949)

Large Scale Serving

  • XBO (Extended Dual-Batch Overlap) (#30120)
  • NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
  • NIXL heterogeneous BlockSize/kv_layout (#30275)
  • Cross-layers KV layout for MultiConnector (#30761)
  • Mooncake protocol expansion (#30133)
  • LMCache KV cache registration (#31397)
  • EPLB default all2all backend (#30559)

Quantization

  • Marlin for Turing (sm75) (#29901, #31000)
  • Quark int4-fp8 w4a8 MoE (#30071)
  • MXFP4 W4A16 dense models (#31926)
  • ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
  • ModelOpt KV cache quantization update (#31895)
  • NVFP4 Marlin for NVFP4A16 MoEs (#30881)
  • Static quant all group shapes (#30833)
  • Default MXFP4 LoRA backend: Marlin (#30598)
  • compressed-tensors 0.13.0 (#30799)

API & Frontend

New Features:

  • gRPC server (#30190)
  • --max-model-len auto (#29431)
  • Model inspection view (#29450)
  • Offline FastAPI docs (#30184)
  • attention_config in LLM() (#30710)
  • MFU metrics (#30738)
  • Iteration logging + NVTX (#31193)
  • reasoning_effort parameter (#31956)

Tool Calling:

  • FunctionGemma parser (#31218)
  • GLM-4.7 parser (#30876)
  • Kimi K2 update (#31207)

CLI:

  • -ep for --enable-expert-parallel (#30890)
  • Complete help messages (#31226)
  • Bench serve auto-discovery + --input-len (#30816)
  • Spec decode acceptance stats (#31739)
  • --enable-log-deltas (renamed) (#32020)
  • --default-chat-template-kwargs (#31343)

API:

  • /server_info env info (#31899)
  • MCP streaming in Responses API (#31761)
  • /embeddings continue_final_message (#31497)
  • Reranking score templates (#30550)
  • Chat template warmup (#30700)
  • Configurable handshake timeout (#27444)
  • Better 500 errors (#20610)
  • Worker init logging (#29493)
  • Bench error reporting (#31808)
  • Corrupted video recovery (#29197)
  • Spec-decode param validation (#31982)
  • Validation error metadata (#30134)

Security

  • Prevent token leaks in crash logs (#30751)
  • weights_only=True in torch.load (#32045)

Dependencies

  • PyTorch 2.9.1 (#28495)
  • compressed-tensors 0.13.0 (#30799)
  • CUDA 13 LMCache/NIXL in Docker (#30913)
  • Configurable NVSHMEM version (#30732)

Bug Fixes (User-Facing)

  • Invalid UTF-8 tokens (#28874)
  • CPU RoPE gibberish with --enforce-eager (#31643)
  • Tool call streaming finish chunk (#31438)
  • Encoder cache leak CPU scheduling stuck (#31857)
  • Engine crash: tools + response_format (#32127)
  • Voxtral transcription API (#31388)
  • Safetensors download optimization (#30537)

Deprecations

  • Deprecated quantization schemes removed (#31688, #31285)
  • seed_everything deprecated (#31659)

Documentation

  • vllm-metal plugin docs (#31174)
  • Claude Code example (#31188)
  • CustomOp developer guide (#30886)

New Contributors 🎉

  • @penfree made their first contribution in https://github.com/vllm-project/vllm/pull/30237
  • @jiangkuaixue123 made their first contribution in https://github.com/vllm-project/vllm/pull/30120
  • @jr-shen made their first contribution in https://github.com/vllm-project/vllm/pull/29663
  • @grzegorz-k-karch made their first contribution in https://github.com/vllm-project/vllm/pull/30795
  • @shanjiaz made their first contribution in https://github.com/vllm-project/vllm/pull/30799
  • @Somoku made their first contribution in https://github.com/vllm-project/vllm/pull/29569
  • @baoqian426 made their first contribution in https://github.com/vllm-project/vllm/pull/30841
  • @SongDI911 made their first contribution in https://github.com/vllm-project/vllm/pull/30852
  • @www-spam made their first contribution in https://github.com/vllm-project/vllm/pull/30827
  • @Xunzhuo made their first contribution in https://github.com/vllm-project/vllm/pull/30844
  • @TheCodeWrangler made their first contribution in https://github.com/vllm-project/vllm/pull/30700
  • @SungMinCho made their first contribution in https://github.com/vllm-project/vllm/pull/30738
  • @sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
  • @wzyrrr made their first contribution in https://github.com/vllm-project/vllm/pull/30949
  • @navmarri14 made their first contribution in https://github.com/vllm-project/vllm/pull/30629
  • @HaloWorld made their first contribution in https://github.com/vllm-project/vllm/pull/30867
  • @jeffreywang-anyscale made their first contribution in https://github.com/vllm-project/vllm/pull/31013
  • @AmeenP made their first contribution in https://github.com/vllm-project/vllm/pull/31093
  • @westers made their first contribution in https://github.com/vllm-project/vllm/pull/31071
  • @CedricHwong made their first contribution in https://github.com/vllm-project/vllm/pull/30957
  • @c0de128 made their first contribution in https://github.com/vllm-project/vllm/pull/31114
  • @Bounty-hunter made their first contribution in https://github.com/vllm-project/vllm/pull/30242
  • @jzakrzew made their first contribution in https://github.com/vllm-project/vllm/pull/30550
  • @1643661061leo made their first contribution in https://github.com/vllm-project/vllm/pull/30760
  • @NickCao made their first contribution in https://github.com/vllm-project/vllm/pull/30070
  • @amithkk made their first contribution in https://github.com/vllm-project/vllm/pull/31212
  • @gateremark made their first contribution in https://github.com/vllm-project/vllm/pull/31218
  • @Tiiiktak made their first contribution in https://github.com/vllm-project/vllm/pull/31274
  • @oscardev256 made their first contribution in https://github.com/vllm-project/vllm/pull/28367
  • @Jzz1943 made their first contribution in https://github.com/vllm-project/vllm/pull/31448
  • @mratsim made their first contribution in https://github.com/vllm-project/vllm/pull/31407
  • @twjww made their first contribution in https://github.com/vllm-project/vllm/pull/31445
  • @amittell made their first contribution in https://github.com/vllm-project/vllm/pull/31438
  • @ricky-chaoju made their first contribution in https://github.com/vllm-project/vllm/pull/30184
  • @effortprogrammer made their first contribution in https://github.com/vllm-project/vllm/pull/31343
  • @ZT-AIA made their first contribution in https://github.com/vllm-project/vllm/pull/31408
  • @rogerxfeng8 made their first contribution in https://github.com/vllm-project/vllm/pull/31522
  • @kevin-pw made their first contribution in https://github.com/vllm-project/vllm/pull/31497
  • @vintipandey made their first contribution in https://github.com/vllm-project/vllm/pull/31505
  • @SameerAsal made their first contribution in https://github.com/vllm-project/vllm/pull/31520
  • @Dylan1229 made their first contribution in https://github.com/vllm-project/vllm/pull/31546
  • @reaganjlee made their first contribution in https://github.com/vllm-project/vllm/pull/29105
  • @zhima771 made their first contribution in https://github.com/vllm-project/vllm/pull/31569
  • @jayhemnani9910 made their first contribution in https://github.com/vllm-project/vllm/pull/31513
  • @Tmn07 made their first contribution in https://github.com/vllm-project/vllm/pull/31572
  • @vsourirajan made their first contribution in https://github.com/vllm-project/vllm/pull/31549
  • @labAxiaoming made their first contribution in https://github.com/vllm-project/vllm/pull/31601
  • @massif-01 made their first contribution in https://github.com/vllm-project/vllm/pull/31604
  • @PHOEBEMOON0802 made their first contribution in https://github.com/vllm-project/vllm/pull/31147
  • @tpopp made their first contribution in https://github.com/vllm-project/vllm/pull/29993
  • @ppppqp made their first contribution in https://github.com/vllm-project/vllm/pull/31620
  • @zzzzwwjj made their first contribution in https://github.com/vllm-project/vllm/pull/31674
  • @Catacomba made their first contribution in https://github.com/vllm-project/vllm/pull/30322
  • @kunpengW-code made their first contribution in https://github.com/vllm-project/vllm/pull/31669
  • @johncalesp made their first contribution in https://github.com/vllm-project/vllm/pull/28874
  • @BlankRH made their first contribution in https://github.com/vllm-project/vllm/pull/31800
  • @guicho271828 made their first contribution in https://github.com/vllm-project/vllm/pull/20610
  • @ReinforcedKnowledge made their first contribution in https://github.com/vllm-project/vllm/pull/31055
  • @vSeamar made their first contribution in https://github.com/vllm-project/vllm/pull/29197
  • @A1c0r-Z made their first contribution in https://github.com/vllm-project/vllm/pull/31656
  • @MrIceCreamMan made their first contribution in https://github.com/vllm-project/vllm/pull/31465
  • @tianshu-Michael-yu made their first contribution in https://github.com/vllm-project/vllm/pull/31841
  • @weiyu0824 made their first contribution in https://github.com/vllm-project/vllm/pull/30808
  • @andyl98 made their first contribution in https://github.com/vllm-project/vllm/pull/31757
  • @JaredforReal made their first contribution in https://github.com/vllm-project/vllm/pull/31779
  • @katec846 made their first contribution in https://github.com/vllm-project/vllm/pull/29213
  • @kfirtoledo made their first contribution in https://github.com/vllm-project/vllm/pull/30761
  • @Ayobami-00 made their first contribution in https://github.com/vllm-project/vllm/pull/31868
  • @ShaanveerS made their first contribution in https://github.com/vllm-project/vllm/pull/31825
  • @Zyyeric made their first contribution in https://github.com/vllm-project/vllm/pull/31652
  • @wangshangsam made their first contribution in https://github.com/vllm-project/vllm/pull/31775
  • @devbyteai made their first contribution in https://github.com/vllm-project/vllm/pull/31536
  • @BJWang-ant made their first contribution in https://github.com/vllm-project/vllm/pull/31719
  • @dangoldbj made their first contribution in https://github.com/vllm-project/vllm/pull/31847
  • @maylikenoother made their first contribution in https://github.com/vllm-project/vllm/pull/31610
  • @yxing-bj made their first contribution in https://github.com/vllm-project/vllm/pull/31575
  • @xbfs made their first contribution in https://github.com/vllm-project/vllm/pull/31948
  • @RunkaiTao made their first contribution in https://github.com/vllm-project/vllm/pull/29354
  • @AkshatSh made their first contribution in https://github.com/vllm-project/vllm/pull/31550
  • @frelam made their first contribution in https://github.com/vllm-project/vllm/pull/31857
  • @shyeh25 made their first contribution in https://github.com/vllm-project/vllm/pull/31617
  • @andikarachman made their first contribution in https://github.com/vllm-project/vllm/pull/32092
  • @minimAluminiumalism made their first contribution in https://github.com/vllm-project/vllm/pull/32158
  • @andyzhangx made their first contribution in https://github.com/vllm-project/vllm/pull/32185
  • @sanghoon-yn made their first contribution in https://github.com/vllm-project/vllm/pull/31956
  • @potatosalad made their first contribution in https://github.com/vllm-project/vllm/pull/32212

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.13.0...v0.14.0