FList

vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)!

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

  • GPU Model Runner V2 (#25266): Major refactoring that removes persistent batch management, introduces GPU-persistent block tables, and features a Triton-native sampler with efficient logprobs support.
  • Prefill Context Parallel (PCP) (#28718): Enhances long-sequence inference by partitioning the sequence dimension during prefill, complementing Decode Context Parallel (DCP).
  • EAGLE Speculative Decoding: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).

Model Support

  • New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
  • Format support: Gemma3 GGUF multimodal support (#27772).
  • Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
  • Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

  • GPU Model Runner V2 (#25266): Complete refactoring of model execution pipeline:

    • No "reordering" or complex bookkeeping with persistent batch removal
    • GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
    • Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
    • Simplified DP and CUDA graph implementations
    • Efficient structured outputs support
  • Prefill Context Parallel (PCP) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.

  • RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).

  • KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).

  • Audio support: Audio embeddings support in chat completions (#29059).

  • Speculative Decoding:

    • Multi-step Eagle with CUDA graph (#29559)
    • EAGLE DP>1 support (#26086)
    • EAGLE3 heads without use_aux_hidden_states (#27688)
    • Eagle multimodal CUDA graphs with MRoPE (#28896)
    • Logprobs support with spec decode + async scheduling (#29223)
  • Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

  • NVIDIA Performance:

    • Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
    • Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
    • DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
    • DeepEP low-latency with round-robin expert placement (#28449)
    • NVFP4 MoE CUTLASS support for SM120 (#29242)
    • H200 Fused MoE Config improvements (#28992)
  • AMD ROCm:

    • DeepSeek v3.2 and SparseMLA support (#26670)
    • FP8 MLA decode support (#28032)
    • AITER sampling ops integration (#26084)
    • AITER triton attention backend (#28701)
    • Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
    • Fastsafetensors support (#28225)
    • Sliding window support for AiterFlashAttentionBackend (#29234)
    • Whisper v1 with Aiter Unified/Flash Attention (#28376)
  • CPU:

    • Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
    • Parallelize over tokens in int4 MoE (#29600)
    • CPU all reduce optimization for async_scheduling + DP>1 (#29311)
  • Attention: FlashAttention ViT support, now default backend (#28763).

  • Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).

  • Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).

  • Docker: Image size reduced by ~200MB (#29060).

Quantization

  • W4A8: Marlin kernel support (#24722).
  • NVFP4:
    • MoE CUTLASS support for SM120 (#29242)
    • TRTLLM MoE NVFP4 kernel (#28892)
    • CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
    • Non-gated activations support in modelopt path (#29004)
  • AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
  • LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
  • Online quantization: Moved to model.load_weights (#26327).

API & Frontend

  • Responses API:
    • Multi-turn support for non-harmony requests (#29175)
    • Reasoning item input parsing (#28248)
  • Tool Calling:
    • Parsed tool arguments support (#28820)
    • parallel_tool_calls param compliance (#26233)
    • Tool filtering support in ToolServer (#29224)
  • Whisper: verbose_json and timestamp features for transcription/translation (#24209).
  • Sampling: Flat logprob control moved from env var to SamplingParams (#28914).
  • GGUF: Improved HuggingFace loading UX with repo_id:quant_type syntax (#29137).
  • Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
  • Logs: Colorized log output (#29017).

Dependencies

  • PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
  • xgrammar: Updated to 0.1.27 (#28221).
  • Transformers: Updated to 4.57.3 (#29418), preparation for v5 with rope_parameters (#28542).
  • XPU: torch & IPEX 2.9 upgrade (#29307).

V0 Deprecation & Breaking Changes

Removed Parameters:

  • num_lookahead_slots (#29000)
  • best_of (#29090)
  • LoRA extra vocab (#28545)

Deprecated:

  • xformers backend (#29262)
  • seed=None (#29185)

Scheduled Removals (will be removed in future release):

  • ParallelConfig's direct child EPLB fields (#29324)
  • guided_* config fields (#29326)
  • override_pooler_config and disable_log_requests (#29402)
  • CompilationConfig.use_inductor (#29323)
  • Deprecated metrics (#29330)

Other Breaking Changes:

  • PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
  • Mistral format auto-detection for model loading (#28659)

New Contributors

  • @jesse996 made their first contribution in https://github.com/vllm-project/vllm/pull/28846
  • @Nepherpitou made their first contribution in https://github.com/vllm-project/vllm/pull/28960
  • @Samoed made their first contribution in https://github.com/vllm-project/vllm/pull/27329
  • @j20120307 made their first contribution in https://github.com/vllm-project/vllm/pull/28999
  • @vnadathur made their first contribution in https://github.com/vllm-project/vllm/pull/26468
  • @zhyajie made their first contribution in https://github.com/vllm-project/vllm/pull/28942
  • @IzzyPutterman made their first contribution in https://github.com/vllm-project/vllm/pull/28896
  • @rjrock-amd made their first contribution in https://github.com/vllm-project/vllm/pull/28905
  • @zq1997 made their first contribution in https://github.com/vllm-project/vllm/pull/27715
  • @shengliangxu made their first contribution in https://github.com/vllm-project/vllm/pull/28076
  • @prashanth058 made their first contribution in https://github.com/vllm-project/vllm/pull/28972
  • @qgallouedec made their first contribution in https://github.com/vllm-project/vllm/pull/28820
  • @zhanggzh made their first contribution in https://github.com/vllm-project/vllm/pull/19347
  • @pandalee99 made their first contribution in https://github.com/vllm-project/vllm/pull/26628
  • @dsuhinin made their first contribution in https://github.com/vllm-project/vllm/pull/29100
  • @xli made their first contribution in https://github.com/vllm-project/vllm/pull/29124
  • @jeremyteboul made their first contribution in https://github.com/vllm-project/vllm/pull/29059
  • @soodoshll made their first contribution in https://github.com/vllm-project/vllm/pull/28875
  • @bhagyashrigai made their first contribution in https://github.com/vllm-project/vllm/pull/28957
  • @skaraban3807 made their first contribution in https://github.com/vllm-project/vllm/pull/25559
  • @Victor49152 made their first contribution in https://github.com/vllm-project/vllm/pull/28892
  • @rjrock made their first contribution in https://github.com/vllm-project/vllm/pull/29205
  • @FlintyLemming made their first contribution in https://github.com/vllm-project/vllm/pull/29182
  • @madskildegaard made their first contribution in https://github.com/vllm-project/vllm/pull/29175
  • @nandan2003 made their first contribution in https://github.com/vllm-project/vllm/pull/29189
  • @michaelact made their first contribution in https://github.com/vllm-project/vllm/pull/29173
  • @yongming-qin made their first contribution in https://github.com/vllm-project/vllm/pull/28958
  • @joshiemoore made their first contribution in https://github.com/vllm-project/vllm/pull/29249
  • @lim4349 made their first contribution in https://github.com/vllm-project/vllm/pull/29068
  • @apinge made their first contribution in https://github.com/vllm-project/vllm/pull/28376
  • @gbyu-amd made their first contribution in https://github.com/vllm-project/vllm/pull/28032
  • @kflu made their first contribution in https://github.com/vllm-project/vllm/pull/29364
  • @Inokinoki made their first contribution in https://github.com/vllm-project/vllm/pull/29200
  • @GOavi101 made their first contribution in https://github.com/vllm-project/vllm/pull/29313
  • @sts07142 made their first contribution in https://github.com/vllm-project/vllm/pull/29137
  • @ivanium made their first contribution in https://github.com/vllm-project/vllm/pull/29143
  • @geodavic made their first contribution in https://github.com/vllm-project/vllm/pull/28795
  • @Yejing-Lai made their first contribution in https://github.com/vllm-project/vllm/pull/29473
  • @Adityayxt made their first contribution in https://github.com/vllm-project/vllm/pull/29491
  • @guodongxiaren made their first contribution in https://github.com/vllm-project/vllm/pull/29620
  • @askliar made their first contribution in https://github.com/vllm-project/vllm/pull/29426
  • @scydas made their first contribution in https://github.com/vllm-project/vllm/pull/29589
  • @EanWang211123 made their first contribution in https://github.com/vllm-project/vllm/pull/29594
  • @qGentry made their first contribution in https://github.com/vllm-project/vllm/pull/29506
  • @HappyAmazonian made their first contribution in https://github.com/vllm-project/vllm/pull/29335
  • @rgommers made their first contribution in https://github.com/vllm-project/vllm/pull/29241
  • @staugust made their first contribution in https://github.com/vllm-project/vllm/pull/28840
  • @mertunsall made their first contribution in https://github.com/vllm-project/vllm/pull/29667
  • @dublc made their first contribution in https://github.com/vllm-project/vllm/pull/29728
  • @nwaughachukwuma made their first contribution in https://github.com/vllm-project/vllm/pull/29671
  • @BowTen made their first contribution in https://github.com/vllm-project/vllm/pull/29731
  • @omera-nv made their first contribution in https://github.com/vllm-project/vllm/pull/29004
  • @zhangruoxu made their first contribution in https://github.com/vllm-project/vllm/pull/29568
  • @KKKZOZ made their first contribution in https://github.com/vllm-project/vllm/pull/29783
  • @FredericOdermatt made their first contribution in https://github.com/vllm-project/vllm/pull/29784
  • @Abdennacer-Badaoui made their first contribution in https://github.com/vllm-project/vllm/pull/29782
  • @knlnguyen1802 made their first contribution in https://github.com/vllm-project/vllm/pull/28525
  • @finbarrtimbers made their first contribution in https://github.com/vllm-project/vllm/pull/29796
  • @hholtmann made their first contribution in https://github.com/vllm-project/vllm/pull/29711

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.11.1...v0.12.0