vLLM v0.13.0 Release Notes Highlights
Highlights
This release features 442 commits from 207 contributors (61 new contributors)!
Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.
Model Support
- New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
- Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
- Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper
drop_thinking logic (#30490), DeepSeek V3.2 top-k fix (#27568).
- Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
- Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
- Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba
selective_state_update spec decode (#29488).
Engine Core
- Compilation: Conditional compilation via
compile_ranges for selective kernel compilation (#24252).
- Prefix caching: xxHash high-performance hash option (#29163).
- Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306),
TRITON_MLA without prefix-caching (#29125).
- Batch invariance: FA2 and LoRA batch-invariant support (#30018).
- Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
- Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
- Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
- Whisper: Major performance improvements - V1 is now faster than V0 (~3x speedup vs v0.12.0). Encoder batching (#29421),
FULL_DECODE_ONLY CUDA graph (#30072), CPU backend support (#30062).
- Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).
Hardware & Performance
- NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
- DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
group_topk kernel: 1.9% throughput, 2.1% TPOT improvement (#30159)
- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
- CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
- AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
- Intel XPU: wNa16 compressed tensors (#29484).
- Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).
Large Scale Serving & Disaggregated Prefill/Decode
- KV connectors: Mooncake Transfer Engine (#24718), cache reset via
/reset_prefix_cache (#27170), KV events (#28309), failure recovery config (#26813).
- NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
- EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
- Multi-node: External launcher mode (#29833).
- Hybrid allocator: Optional KV connector integration (#29805).
- Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).
Quantization
- New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
- MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
- GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
- Compatibility: Transformers v5 RoPE support (#30046).
API & Frontend
- Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
- Configuration:
AttentionConfig replaces VLLM_ATTENTION_BACKEND env var (#26315).
- Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
- Anthropic API: Streaming fixes (#29971, #30266).
- Embeddings: Binary format with
encoding_format=bytes_only (#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794).
- Metrics: Prefill KV compute metric excluding cached tokens (#30189).
- Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
- UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).
Security
- Additional protection for CVE-2025-62164 (#30649).
Dependencies
- NVSHMEM 3.3.24 + CUDA 13 fix (#30149).
- TPU tpu-inference 0.12.0 (#30221).
Breaking Changes & Deprecations
- PassConfig flags renamed per RFC #27995 (#29646)
- Attention env vars → CLI args:
VLLM_ATTENTION_BACKEND replaced with --attention-backend (#26315)
- Removed
-O.xx flag (#29991)
- Removed deprecated plugin/compilation fields (#30396)
- Removed deprecated task, seed, MM settings (#30397)
- Removed
embed_input_ids/embed_multimodal fallbacks (#30458)
- Removed tokenizer setter (#30400)
- Deprecations:
merge_by_field_config (#30035, #30170), --convert reward → --convert embed (#30463)
New Contributors 🎉
- @ajpqs made their first contribution in https://github.com/vllm-project/vllm/pull/29905
- @amitz-nv made their first contribution in https://github.com/vllm-project/vllm/pull/29978
- @amrmahdi made their first contribution in https://github.com/vllm-project/vllm/pull/29452
- @andrewbriand made their first contribution in https://github.com/vllm-project/vllm/pull/29804
- @anker-c2 made their first contribution in https://github.com/vllm-project/vllm/pull/30344
- @AuruTus made their first contribution in https://github.com/vllm-project/vllm/pull/30182
- @avigny made their first contribution in https://github.com/vllm-project/vllm/pull/19425
- @Bhanu068 made their first contribution in https://github.com/vllm-project/vllm/pull/30254
- @Copilot made their first contribution in https://github.com/vllm-project/vllm/pull/29025
- @dbotwinick made their first contribution in https://github.com/vllm-project/vllm/pull/30583
- @dependabot[bot] made their first contribution in https://github.com/vllm-project/vllm/pull/30234
- @desertfire made their first contribution in https://github.com/vllm-project/vllm/pull/29919
- @dmitry-tokarev-nv made their first contribution in https://github.com/vllm-project/vllm/pull/30149
- @drslark made their first contribution in https://github.com/vllm-project/vllm/pull/30632
- @dtcccc made their first contribution in https://github.com/vllm-project/vllm/pull/24718
- @elizabetht made their first contribution in https://github.com/vllm-project/vllm/pull/28671
- @Elm8116 made their first contribution in https://github.com/vllm-project/vllm/pull/30068
- @gausah01 made their first contribution in https://github.com/vllm-project/vllm/pull/29604
- @gh-wf made their first contribution in https://github.com/vllm-project/vllm/pull/30285
- @hdlj-h made their first contribution in https://github.com/vllm-project/vllm/pull/30056
- @HF-001 made their first contribution in https://github.com/vllm-project/vllm/pull/30051
- @hzxuzhonghu made their first contribution in https://github.com/vllm-project/vllm/pull/29931
- @JaviS-Rei made their first contribution in https://github.com/vllm-project/vllm/pull/29882
- @johannesflommersfeld made their first contribution in https://github.com/vllm-project/vllm/pull/30390
- @KevinMusgrave made their first contribution in https://github.com/vllm-project/vllm/pull/30529
- @kitaekatt made their first contribution in https://github.com/vllm-project/vllm/pull/30408
- @lashahub made their first contribution in https://github.com/vllm-project/vllm/pull/30539
- @LuminolT made their first contribution in https://github.com/vllm-project/vllm/pull/29163
- @majiayu000 made their first contribution in https://github.com/vllm-project/vllm/pull/30615
- @MaoJianwei made their first contribution in https://github.com/vllm-project/vllm/pull/29797
- @Mercykid-bash made their first contribution in https://github.com/vllm-project/vllm/pull/26471
- @mgehre-amd made their first contribution in https://github.com/vllm-project/vllm/pull/30364
- @mivehk made their first contribution in https://github.com/vllm-project/vllm/pull/30512
- @mondaylord made their first contribution in https://github.com/vllm-project/vllm/pull/30671
- @noa-neria made their first contribution in https://github.com/vllm-project/vllm/pull/29320
- @PatrykSaffer made their first contribution in https://github.com/vllm-project/vllm/pull/30330
- @Peng-YM made their first contribution in https://github.com/vllm-project/vllm/pull/29074
- @realliujiaxu made their first contribution in https://github.com/vllm-project/vllm/pull/30059
- @redwrasse made their first contribution in https://github.com/vllm-project/vllm/pull/29261
- @Ri0S made their first contribution in https://github.com/vllm-project/vllm/pull/30532
- @sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
- @scratch-ml made their first contribution in https://github.com/vllm-project/vllm/pull/30351
- @seokhyunan made their first contribution in https://github.com/vllm-project/vllm/pull/30648
- @shaharmor98 made their first contribution in https://github.com/vllm-project/vllm/pull/30203
- @taoyun951753 made their first contribution in https://github.com/vllm-project/vllm/pull/30037
- @tom-zju made their first contribution in https://github.com/vllm-project/vllm/pull/30057
- @tomtomjhj made their first contribution in https://github.com/vllm-project/vllm/pull/29692
- @vkuzo made their first contribution in https://github.com/vllm-project/vllm/pull/29196
- @vladnosiv made their first contribution in https://github.com/vllm-project/vllm/pull/30490
- @weiguihua2 made their first contribution in https://github.com/vllm-project/vllm/pull/30042
- @wenqiglantz made their first contribution in https://github.com/vllm-project/vllm/pull/30649
- @wkcn made their first contribution in https://github.com/vllm-project/vllm/pull/29879
- @wu-kan made their first contribution in https://github.com/vllm-project/vllm/pull/21804
- @wz1qqx made their first contribution in https://github.com/vllm-project/vllm/pull/30376
- @xyDong0223 made their first contribution in https://github.com/vllm-project/vllm/pull/30455
- @yifant-code made their first contribution in https://github.com/vllm-project/vllm/pull/30213
- @yjc9696 made their first contribution in https://github.com/vllm-project/vllm/pull/30040
- @yurekami made their first contribution in https://github.com/vllm-project/vllm/pull/30552
- @yuttian1 made their first contribution in https://github.com/vllm-project/vllm/pull/30102
- @ZhijianJiang made their first contribution in https://github.com/vllm-project/vllm/pull/30219
- @ZhiweiYan-96 made their first contribution in https://github.com/vllm-project/vllm/pull/29773
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.12.0...v0.13.0