FList

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support

  • New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
  • LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
  • Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
  • Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
  • Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).

Engine Core

  • Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#32359).
  • Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877).
  • Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#28973).
  • Model Runner V2: VLM support (#32546), architecture improvements.
  • LoRA: Inplace loading for memory efficiency (#31326).
  • AOT compilation: torch.compile inductor artifacts support (#25205).
  • Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).

Hardware & Performance

NVIDIA

  • Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
  • MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
  • FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
  • Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
  • torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).

AMD ROCm

  • MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
  • Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
  • FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
  • Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).

Other Platforms

  • TPU: Pipeline parallelism support (#28506), backend option (#32438).
  • Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
  • CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
  • Whisper: torch.compile support (#30385).
  • WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).

Quantization

  • MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
  • Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
  • Intel: Quantization Toolkit integration (#31716).
  • FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).

API & Frontend

  • Responses API: Partial message generation (#32100), include_stop_str_in_output tuning (#32383), prompt_cache_key support (#32824).
  • OpenAI API: skip_special_tokens configuration (#32345).
  • Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#32577).
  • Render endpoints: New endpoints for prompt preprocessing (#32473).
  • Whisper API: avg_logprob and compression_ratio in verbose_json segments (#31059).
  • Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386), --ssl-ciphers CLI argument (#30937).
  • UX improvements: Auto api_server_count based on dp_size (#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).

Dependencies

  • FlashInfer v0.6.1 (#30993)
  • Transformers 4.57.5 (#32287)
  • PyTorch 2.10 for CPU backend (#32869)
  • DeepGEMM newer version (#32479)

Breaking Changes & Deprecations

  • Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#32661).
  • Environment variables: Removed deprecated environment variables (#32812).
  • Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).

Bug Fixes

  • Speculative decoding: Eagle draft_model_config fix (#31753).
  • DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
  • Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
  • EPLB: Possible deadlock fix (#32418).
  • NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
  • Structured output: Outlines byte fallback handling fix (#31391).

New Contributors 🎉

  • @YunzhuLu made their first contribution in https://github.com/vllm-project/vllm/pull/32126
  • @emricksini-h made their first contribution in https://github.com/vllm-project/vllm/pull/30784
  • @dsfaccini made their first contribution in https://github.com/vllm-project/vllm/pull/32289
  • @ofirzaf made their first contribution in https://github.com/vllm-project/vllm/pull/32312
  • @seekskyworld made their first contribution in https://github.com/vllm-project/vllm/pull/32321
  • @brian033 made their first contribution in https://github.com/vllm-project/vllm/pull/31715
  • @TomerBN-Nvidia made their first contribution in https://github.com/vllm-project/vllm/pull/32257
  • @vanshilshah97 made their first contribution in https://github.com/vllm-project/vllm/pull/32448
  • @George-Polya made their first contribution in https://github.com/vllm-project/vllm/pull/32385
  • @T1mn made their first contribution in https://github.com/vllm-project/vllm/pull/32411
  • @mritunjaysharma394 made their first contribution in https://github.com/vllm-project/vllm/pull/31492
  • @randzero made their first contribution in https://github.com/vllm-project/vllm/pull/32511
  • @DemingCheng made their first contribution in https://github.com/vllm-project/vllm/pull/32556
  • @iboiko-habana made their first contribution in https://github.com/vllm-project/vllm/pull/32471
  • @honglyua-il made their first contribution in https://github.com/vllm-project/vllm/pull/32462
  • @hyeongyun0916 made their first contribution in https://github.com/vllm-project/vllm/pull/32473
  • @DanielMe made their first contribution in https://github.com/vllm-project/vllm/pull/32560
  • @netanel-haber made their first contribution in https://github.com/vllm-project/vllm/pull/32121
  • @longregen made their first contribution in https://github.com/vllm-project/vllm/pull/28784
  • @jasonyanwenl made their first contribution in https://github.com/vllm-project/vllm/pull/32749
  • @Wauplin made their first contribution in https://github.com/vllm-project/vllm/pull/32788
  • @ikaadil made their first contribution in https://github.com/vllm-project/vllm/pull/32775
  • @alexsun07 made their first contribution in https://github.com/vllm-project/vllm/pull/28664
  • @liranschour made their first contribution in https://github.com/vllm-project/vllm/pull/30207
  • @AuYang261 made their first contribution in https://github.com/vllm-project/vllm/pull/32844
  • @diviramon made their first contribution in https://github.com/vllm-project/vllm/pull/32393
  • @RishabhSaini made their first contribution in https://github.com/vllm-project/vllm/pull/32884
  • @MatteoFari made their first contribution in https://github.com/vllm-project/vllm/pull/32397
  • @peakcrosser7 made their first contribution in https://github.com/vllm-project/vllm/pull/30877
  • @orionr made their first contribution in https://github.com/vllm-project/vllm/pull/30443
  • @marksverdhei made their first contribution in https://github.com/vllm-project/vllm/pull/32614
  • @joninco made their first contribution in https://github.com/vllm-project/vllm/pull/32935
  • @monajafi-amd made their first contribution in https://github.com/vllm-project/vllm/pull/32944
  • @ruizcrp made their first contribution in https://github.com/vllm-project/vllm/pull/32988
  • @sjhddh made their first contribution in https://github.com/vllm-project/vllm/pull/32983
  • @HirokenOvo made their first contribution in https://github.com/vllm-project/vllm/pull/32646
  • @Chenhao-Guan made their first contribution in https://github.com/vllm-project/vllm/pull/32763
  • @joshuadeng made their first contribution in https://github.com/vllm-project/vllm/pull/28973
  • @ZhanqiuHu made their first contribution in https://github.com/vllm-project/vllm/pull/33016

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.14.1...v0.15.0