vLLM v0.13.0 Release Notes Highlights

Highlights

This release features 442 commits from 207 contributors (61 new contributors)!

Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.

Model Support

New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper drop_thinking logic (#30490), DeepSeek V3.2 top-k fix (#27568).
Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba selective_state_update spec decode (#29488).

Engine Core

Compilation: Conditional compilation via compile_ranges for selective kernel compilation (#24252).
Prefix caching: xxHash high-performance hash option (#29163).
Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306), TRITON_MLA without prefix-caching (#29125).
Batch invariance: FA2 and LoRA batch-invariant support (#30018).
Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
Whisper: Major performance improvements - V1 is now faster than V0 (~3x speedup vs v0.12.0). Encoder batching (#29421), FULL_DECODE_ONLY CUDA graph (#30072), CPU backend support (#30062).
Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).

Hardware & Performance

NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
- group_topk kernel: 1.9% throughput, 2.1% TPOT improvement (#30159)
- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
Intel XPU: wNa16 compressed tensors (#29484).
Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).

Large Scale Serving & Disaggregated Prefill/Decode

KV connectors: Mooncake Transfer Engine (#24718), cache reset via /reset_prefix_cache (#27170), KV events (#28309), failure recovery config (#26813).
NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
Multi-node: External launcher mode (#29833).
Hybrid allocator: Optional KV connector integration (#29805).
Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).

Quantization

New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
Compatibility: Transformers v5 RoPE support (#30046).

API & Frontend

Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
Configuration: AttentionConfig replaces VLLM_ATTENTION_BACKEND env var (#26315).
Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
Anthropic API: Streaming fixes (#29971, #30266).
Embeddings: Binary format with encoding_format=bytes_only (#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794).
Metrics: Prefill KV compute metric excluding cached tokens (#30189).
Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).

Security

Additional protection for CVE-2025-62164 (#30649).

Dependencies

NVSHMEM 3.3.24 + CUDA 13 fix (#30149).
TPU tpu-inference 0.12.0 (#30221).

Breaking Changes & Deprecations

PassConfig flags renamed per RFC #27995 (#29646)
Attention env vars → CLI args: VLLM_ATTENTION_BACKEND replaced with --attention-backend (#26315)
Removed -O.xx flag (#29991)
Removed deprecated plugin/compilation fields (#30396)
Removed deprecated task, seed, MM settings (#30397)
Removed embed_input_ids/embed_multimodal fallbacks (#30458)
Removed tokenizer setter (#30400)
Deprecations: merge_by_field_config (#30035, #30170), --convert reward → --convert embed (#30463)

New Contributors 🎉

@ajpqs made their first contribution in https://github.com/vllm-project/vllm/pull/29905
@amitz-nv made their first contribution in https://github.com/vllm-project/vllm/pull/29978
@amrmahdi made their first contribution in https://github.com/vllm-project/vllm/pull/29452
@andrewbriand made their first contribution in https://github.com/vllm-project/vllm/pull/29804
@anker-c2 made their first contribution in https://github.com/vllm-project/vllm/pull/30344
@AuruTus made their first contribution in https://github.com/vllm-project/vllm/pull/30182
@avigny made their first contribution in https://github.com/vllm-project/vllm/pull/19425
@Bhanu068 made their first contribution in https://github.com/vllm-project/vllm/pull/30254
@Copilot made their first contribution in https://github.com/vllm-project/vllm/pull/29025
@dbotwinick made their first contribution in https://github.com/vllm-project/vllm/pull/30583
@dependabot[bot] made their first contribution in https://github.com/vllm-project/vllm/pull/30234
@desertfire made their first contribution in https://github.com/vllm-project/vllm/pull/29919
@dmitry-tokarev-nv made their first contribution in https://github.com/vllm-project/vllm/pull/30149
@drslark made their first contribution in https://github.com/vllm-project/vllm/pull/30632
@dtcccc made their first contribution in https://github.com/vllm-project/vllm/pull/24718
@elizabetht made their first contribution in https://github.com/vllm-project/vllm/pull/28671
@Elm8116 made their first contribution in https://github.com/vllm-project/vllm/pull/30068
@gausah01 made their first contribution in https://github.com/vllm-project/vllm/pull/29604
@gh-wf made their first contribution in https://github.com/vllm-project/vllm/pull/30285
@hdlj-h made their first contribution in https://github.com/vllm-project/vllm/pull/30056
@HF-001 made their first contribution in https://github.com/vllm-project/vllm/pull/30051
@hzxuzhonghu made their first contribution in https://github.com/vllm-project/vllm/pull/29931
@JaviS-Rei made their first contribution in https://github.com/vllm-project/vllm/pull/29882
@johannesflommersfeld made their first contribution in https://github.com/vllm-project/vllm/pull/30390
@KevinMusgrave made their first contribution in https://github.com/vllm-project/vllm/pull/30529
@kitaekatt made their first contribution in https://github.com/vllm-project/vllm/pull/30408
@lashahub made their first contribution in https://github.com/vllm-project/vllm/pull/30539
@LuminolT made their first contribution in https://github.com/vllm-project/vllm/pull/29163
@majiayu000 made their first contribution in https://github.com/vllm-project/vllm/pull/30615
@MaoJianwei made their first contribution in https://github.com/vllm-project/vllm/pull/29797
@Mercykid-bash made their first contribution in https://github.com/vllm-project/vllm/pull/26471
@mgehre-amd made their first contribution in https://github.com/vllm-project/vllm/pull/30364
@mivehk made their first contribution in https://github.com/vllm-project/vllm/pull/30512
@mondaylord made their first contribution in https://github.com/vllm-project/vllm/pull/30671
@noa-neria made their first contribution in https://github.com/vllm-project/vllm/pull/29320
@PatrykSaffer made their first contribution in https://github.com/vllm-project/vllm/pull/30330
@Peng-YM made their first contribution in https://github.com/vllm-project/vllm/pull/29074
@realliujiaxu made their first contribution in https://github.com/vllm-project/vllm/pull/30059
@redwrasse made their first contribution in https://github.com/vllm-project/vllm/pull/29261
@Ri0S made their first contribution in https://github.com/vllm-project/vllm/pull/30532
@sarathc-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/30188
@scratch-ml made their first contribution in https://github.com/vllm-project/vllm/pull/30351
@seokhyunan made their first contribution in https://github.com/vllm-project/vllm/pull/30648
@shaharmor98 made their first contribution in https://github.com/vllm-project/vllm/pull/30203
@taoyun951753 made their first contribution in https://github.com/vllm-project/vllm/pull/30037
@tom-zju made their first contribution in https://github.com/vllm-project/vllm/pull/30057
@tomtomjhj made their first contribution in https://github.com/vllm-project/vllm/pull/29692
@vkuzo made their first contribution in https://github.com/vllm-project/vllm/pull/29196
@vladnosiv made their first contribution in https://github.com/vllm-project/vllm/pull/30490
@weiguihua2 made their first contribution in https://github.com/vllm-project/vllm/pull/30042
@wenqiglantz made their first contribution in https://github.com/vllm-project/vllm/pull/30649
@wkcn made their first contribution in https://github.com/vllm-project/vllm/pull/29879
@wu-kan made their first contribution in https://github.com/vllm-project/vllm/pull/21804
@wz1qqx made their first contribution in https://github.com/vllm-project/vllm/pull/30376
@xyDong0223 made their first contribution in https://github.com/vllm-project/vllm/pull/30455
@yifant-code made their first contribution in https://github.com/vllm-project/vllm/pull/30213
@yjc9696 made their first contribution in https://github.com/vllm-project/vllm/pull/30040
@yurekami made their first contribution in https://github.com/vllm-project/vllm/pull/30552
@yuttian1 made their first contribution in https://github.com/vllm-project/vllm/pull/30102
@ZhijianJiang made their first contribution in https://github.com/vllm-project/vllm/pull/30219
@ZhiweiYan-96 made their first contribution in https://github.com/vllm-project/vllm/pull/29773

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.12.0...v0.13.0