Highlights
This release features 335 commits from 158 contributors (39 new)!
Model Support
- New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
- LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
- Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
- Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
- Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).
Engine Core
- Async scheduling + Pipeline Parallelism:
--async-scheduling now works with pipeline parallelism (#32359).
- Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with
--enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877).
- Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing
StreamingInput objects while maintaining KV cache alignment (#28973).
- Model Runner V2: VLM support (#32546), architecture improvements.
- LoRA: Inplace loading for memory efficiency (#31326).
- AOT compilation: torch.compile inductor artifacts support (#25205).
- Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).
Hardware & Performance
NVIDIA
- Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
- MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
- FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
- Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
- torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).
AMD ROCm
- MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
- Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
- FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
- Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).
Other Platforms
- TPU: Pipeline parallelism support (#28506), backend option (#32438).
- Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
- CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
- Whisper: torch.compile support (#30385).
- WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).
Quantization
- MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
- Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
- Intel: Quantization Toolkit integration (#31716).
- FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).
API & Frontend
- Responses API: Partial message generation (#32100),
include_stop_str_in_output tuning (#32383), prompt_cache_key support (#32824).
- OpenAI API:
skip_special_tokens configuration (#32345).
- Score endpoint: Flexible input formats with
data_1/data_2 and queries/documents (#32577).
- Render endpoints: New endpoints for prompt preprocessing (#32473).
- Whisper API:
avg_logprob and compression_ratio in verbose_json segments (#31059).
- Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386),
--ssl-ciphers CLI argument (#30937).
- UX improvements: Auto
api_server_count based on dp_size (#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).
Dependencies
- FlashInfer v0.6.1 (#30993)
- Transformers 4.57.5 (#32287)
- PyTorch 2.10 for CPU backend (#32869)
- DeepGEMM newer version (#32479)
Breaking Changes & Deprecations
- Metrics: Removed deprecated
vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#32661).
- Environment variables: Removed deprecated environment variables (#32812).
- Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).
Bug Fixes
- Speculative decoding: Eagle draft_model_config fix (#31753).
- DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
- Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
- EPLB: Possible deadlock fix (#32418).
- NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
- Structured output: Outlines byte fallback handling fix (#31391).
New Contributors 🎉
- @YunzhuLu made their first contribution in https://github.com/vllm-project/vllm/pull/32126
- @emricksini-h made their first contribution in https://github.com/vllm-project/vllm/pull/30784
- @dsfaccini made their first contribution in https://github.com/vllm-project/vllm/pull/32289
- @ofirzaf made their first contribution in https://github.com/vllm-project/vllm/pull/32312
- @seekskyworld made their first contribution in https://github.com/vllm-project/vllm/pull/32321
- @brian033 made their first contribution in https://github.com/vllm-project/vllm/pull/31715
- @TomerBN-Nvidia made their first contribution in https://github.com/vllm-project/vllm/pull/32257
- @vanshilshah97 made their first contribution in https://github.com/vllm-project/vllm/pull/32448
- @George-Polya made their first contribution in https://github.com/vllm-project/vllm/pull/32385
- @T1mn made their first contribution in https://github.com/vllm-project/vllm/pull/32411
- @mritunjaysharma394 made their first contribution in https://github.com/vllm-project/vllm/pull/31492
- @randzero made their first contribution in https://github.com/vllm-project/vllm/pull/32511
- @DemingCheng made their first contribution in https://github.com/vllm-project/vllm/pull/32556
- @iboiko-habana made their first contribution in https://github.com/vllm-project/vllm/pull/32471
- @honglyua-il made their first contribution in https://github.com/vllm-project/vllm/pull/32462
- @hyeongyun0916 made their first contribution in https://github.com/vllm-project/vllm/pull/32473
- @DanielMe made their first contribution in https://github.com/vllm-project/vllm/pull/32560
- @netanel-haber made their first contribution in https://github.com/vllm-project/vllm/pull/32121
- @longregen made their first contribution in https://github.com/vllm-project/vllm/pull/28784
- @jasonyanwenl made their first contribution in https://github.com/vllm-project/vllm/pull/32749
- @Wauplin made their first contribution in https://github.com/vllm-project/vllm/pull/32788
- @ikaadil made their first contribution in https://github.com/vllm-project/vllm/pull/32775
- @alexsun07 made their first contribution in https://github.com/vllm-project/vllm/pull/28664
- @liranschour made their first contribution in https://github.com/vllm-project/vllm/pull/30207
- @AuYang261 made their first contribution in https://github.com/vllm-project/vllm/pull/32844
- @diviramon made their first contribution in https://github.com/vllm-project/vllm/pull/32393
- @RishabhSaini made their first contribution in https://github.com/vllm-project/vllm/pull/32884
- @MatteoFari made their first contribution in https://github.com/vllm-project/vllm/pull/32397
- @peakcrosser7 made their first contribution in https://github.com/vllm-project/vllm/pull/30877
- @orionr made their first contribution in https://github.com/vllm-project/vllm/pull/30443
- @marksverdhei made their first contribution in https://github.com/vllm-project/vllm/pull/32614
- @joninco made their first contribution in https://github.com/vllm-project/vllm/pull/32935
- @monajafi-amd made their first contribution in https://github.com/vllm-project/vllm/pull/32944
- @ruizcrp made their first contribution in https://github.com/vllm-project/vllm/pull/32988
- @sjhddh made their first contribution in https://github.com/vllm-project/vllm/pull/32983
- @HirokenOvo made their first contribution in https://github.com/vllm-project/vllm/pull/32646
- @Chenhao-Guan made their first contribution in https://github.com/vllm-project/vllm/pull/32763
- @joshuadeng made their first contribution in https://github.com/vllm-project/vllm/pull/28973
- @ZhanqiuHu made their first contribution in https://github.com/vllm-project/vllm/pull/33016
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.14.1...v0.15.0