vLLM v0.12.0 Release Notes Highlights
Highlights
This release features 474 commits from 213 contributors (57 new)!
Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.
Major Features:
- GPU Model Runner V2 (#25266): Major refactoring that removes persistent batch management, introduces GPU-persistent block tables, and features a Triton-native sampler with efficient logprobs support.
- Prefill Context Parallel (PCP) (#28718): Enhances long-sequence inference by partitioning the sequence dimension during prefill, complementing Decode Context Parallel (DCP).
- EAGLE Speculative Decoding: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
Model Support
- New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
- Format support: Gemma3 GGUF multimodal support (#27772).
- Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
- Performance: QwenVL cos/sin cache optimization (#28798).
Engine Core
-
GPU Model Runner V2 (#25266): Complete refactoring of model execution pipeline:
- No "reordering" or complex bookkeeping with persistent batch removal
- GPU-persistent block tables for better scalability with
max_model_len and num_kv_groups
- Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
- Simplified DP and CUDA graph implementations
- Efficient structured outputs support
-
Prefill Context Parallel (PCP) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.
-
RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).
-
KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).
-
Audio support: Audio embeddings support in chat completions (#29059).
-
Speculative Decoding:
- Multi-step Eagle with CUDA graph (#29559)
- EAGLE DP>1 support (#26086)
- EAGLE3 heads without
use_aux_hidden_states (#27688)
- Eagle multimodal CUDA graphs with MRoPE (#28896)
- Logprobs support with spec decode + async scheduling (#29223)
-
Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).
Hardware & Performance
-
NVIDIA Performance:
- Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
- Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
- DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
- DeepEP low-latency with round-robin expert placement (#28449)
- NVFP4 MoE CUTLASS support for SM120 (#29242)
- H200 Fused MoE Config improvements (#28992)
-
AMD ROCm:
- DeepSeek v3.2 and SparseMLA support (#26670)
- FP8 MLA decode support (#28032)
- AITER sampling ops integration (#26084)
- AITER triton attention backend (#28701)
- Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
- Fastsafetensors support (#28225)
- Sliding window support for AiterFlashAttentionBackend (#29234)
- Whisper v1 with Aiter Unified/Flash Attention (#28376)
-
CPU:
- Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
- Parallelize over tokens in int4 MoE (#29600)
- CPU all reduce optimization for async_scheduling + DP>1 (#29311)
-
Attention: FlashAttention ViT support, now default backend (#28763).
-
Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).
-
Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).
-
Docker: Image size reduced by ~200MB (#29060).
Quantization
- W4A8: Marlin kernel support (#24722).
- NVFP4:
- MoE CUTLASS support for SM120 (#29242)
- TRTLLM MoE NVFP4 kernel (#28892)
- CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
- Non-gated activations support in modelopt path (#29004)
- AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
- LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
- Online quantization: Moved to
model.load_weights (#26327).
API & Frontend
- Responses API:
- Multi-turn support for non-harmony requests (#29175)
- Reasoning item input parsing (#28248)
- Tool Calling:
- Parsed tool arguments support (#28820)
parallel_tool_calls param compliance (#26233)
- Tool filtering support in ToolServer (#29224)
- Whisper:
verbose_json and timestamp features for transcription/translation (#24209).
- Sampling: Flat logprob control moved from env var to
SamplingParams (#28914).
- GGUF: Improved HuggingFace loading UX with
repo_id:quant_type syntax (#29137).
- Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
- Logs: Colorized log output (#29017).
Dependencies
- PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
- xgrammar: Updated to 0.1.27 (#28221).
- Transformers: Updated to 4.57.3 (#29418), preparation for v5 with
rope_parameters (#28542).
- XPU: torch & IPEX 2.9 upgrade (#29307).
V0 Deprecation & Breaking Changes
Removed Parameters:
num_lookahead_slots (#29000)
best_of (#29090)
- LoRA extra vocab (#28545)
Deprecated:
xformers backend (#29262)
seed=None (#29185)
Scheduled Removals (will be removed in future release):
ParallelConfig's direct child EPLB fields (#29324)
guided_* config fields (#29326)
override_pooler_config and disable_log_requests (#29402)
CompilationConfig.use_inductor (#29323)
- Deprecated metrics (#29330)
Other Breaking Changes:
- PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
- Mistral format auto-detection for model loading (#28659)
New Contributors
- @jesse996 made their first contribution in https://github.com/vllm-project/vllm/pull/28846
- @Nepherpitou made their first contribution in https://github.com/vllm-project/vllm/pull/28960
- @Samoed made their first contribution in https://github.com/vllm-project/vllm/pull/27329
- @j20120307 made their first contribution in https://github.com/vllm-project/vllm/pull/28999
- @vnadathur made their first contribution in https://github.com/vllm-project/vllm/pull/26468
- @zhyajie made their first contribution in https://github.com/vllm-project/vllm/pull/28942
- @IzzyPutterman made their first contribution in https://github.com/vllm-project/vllm/pull/28896
- @rjrock-amd made their first contribution in https://github.com/vllm-project/vllm/pull/28905
- @zq1997 made their first contribution in https://github.com/vllm-project/vllm/pull/27715
- @shengliangxu made their first contribution in https://github.com/vllm-project/vllm/pull/28076
- @prashanth058 made their first contribution in https://github.com/vllm-project/vllm/pull/28972
- @qgallouedec made their first contribution in https://github.com/vllm-project/vllm/pull/28820
- @zhanggzh made their first contribution in https://github.com/vllm-project/vllm/pull/19347
- @pandalee99 made their first contribution in https://github.com/vllm-project/vllm/pull/26628
- @dsuhinin made their first contribution in https://github.com/vllm-project/vllm/pull/29100
- @xli made their first contribution in https://github.com/vllm-project/vllm/pull/29124
- @jeremyteboul made their first contribution in https://github.com/vllm-project/vllm/pull/29059
- @soodoshll made their first contribution in https://github.com/vllm-project/vllm/pull/28875
- @bhagyashrigai made their first contribution in https://github.com/vllm-project/vllm/pull/28957
- @skaraban3807 made their first contribution in https://github.com/vllm-project/vllm/pull/25559
- @Victor49152 made their first contribution in https://github.com/vllm-project/vllm/pull/28892
- @rjrock made their first contribution in https://github.com/vllm-project/vllm/pull/29205
- @FlintyLemming made their first contribution in https://github.com/vllm-project/vllm/pull/29182
- @madskildegaard made their first contribution in https://github.com/vllm-project/vllm/pull/29175
- @nandan2003 made their first contribution in https://github.com/vllm-project/vllm/pull/29189
- @michaelact made their first contribution in https://github.com/vllm-project/vllm/pull/29173
- @yongming-qin made their first contribution in https://github.com/vllm-project/vllm/pull/28958
- @joshiemoore made their first contribution in https://github.com/vllm-project/vllm/pull/29249
- @lim4349 made their first contribution in https://github.com/vllm-project/vllm/pull/29068
- @apinge made their first contribution in https://github.com/vllm-project/vllm/pull/28376
- @gbyu-amd made their first contribution in https://github.com/vllm-project/vllm/pull/28032
- @kflu made their first contribution in https://github.com/vllm-project/vllm/pull/29364
- @Inokinoki made their first contribution in https://github.com/vllm-project/vllm/pull/29200
- @GOavi101 made their first contribution in https://github.com/vllm-project/vllm/pull/29313
- @sts07142 made their first contribution in https://github.com/vllm-project/vllm/pull/29137
- @ivanium made their first contribution in https://github.com/vllm-project/vllm/pull/29143
- @geodavic made their first contribution in https://github.com/vllm-project/vllm/pull/28795
- @Yejing-Lai made their first contribution in https://github.com/vllm-project/vllm/pull/29473
- @Adityayxt made their first contribution in https://github.com/vllm-project/vllm/pull/29491
- @guodongxiaren made their first contribution in https://github.com/vllm-project/vllm/pull/29620
- @askliar made their first contribution in https://github.com/vllm-project/vllm/pull/29426
- @scydas made their first contribution in https://github.com/vllm-project/vllm/pull/29589
- @EanWang211123 made their first contribution in https://github.com/vllm-project/vllm/pull/29594
- @qGentry made their first contribution in https://github.com/vllm-project/vllm/pull/29506
- @HappyAmazonian made their first contribution in https://github.com/vllm-project/vllm/pull/29335
- @rgommers made their first contribution in https://github.com/vllm-project/vllm/pull/29241
- @staugust made their first contribution in https://github.com/vllm-project/vllm/pull/28840
- @mertunsall made their first contribution in https://github.com/vllm-project/vllm/pull/29667
- @dublc made their first contribution in https://github.com/vllm-project/vllm/pull/29728
- @nwaughachukwuma made their first contribution in https://github.com/vllm-project/vllm/pull/29671
- @BowTen made their first contribution in https://github.com/vllm-project/vllm/pull/29731
- @omera-nv made their first contribution in https://github.com/vllm-project/vllm/pull/29004
- @zhangruoxu made their first contribution in https://github.com/vllm-project/vllm/pull/29568
- @KKKZOZ made their first contribution in https://github.com/vllm-project/vllm/pull/29783
- @FredericOdermatt made their first contribution in https://github.com/vllm-project/vllm/pull/29784
- @Abdennacer-Badaoui made their first contribution in https://github.com/vllm-project/vllm/pull/29782
- @knlnguyen1802 made their first contribution in https://github.com/vllm-project/vllm/pull/28525
- @finbarrtimbers made their first contribution in https://github.com/vllm-project/vllm/pull/29796
- @hholtmann made their first contribution in https://github.com/vllm-project/vllm/pull/29711
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.11.1...v0.12.0