FList

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

aarch64 support: This release features native support for aarch64 allowing usage of vLLM on GB200 platform. The docker image vllm/vllm-openai should already be multiplatform. To install the wheels, you can download the wheels from this release artifact or install via

uv pip install vllm==0.10.2 --extra-index-url https://wheels.vllm.ai/0.10.2/ --torch-backend=auto

Model Support

  • New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
  • Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
  • Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
  • LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
  • Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
  • Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

  • V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
  • Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
  • Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
  • Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
  • Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
  • Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
  • Distributed: Support Decode Context Parallel (DCP) for MLA (#23734)

Hardware & Performance

  • NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
  • Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
  • Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
  • Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
  • Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
  • Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

  • New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
  • Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
  • FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
  • Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
  • Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
  • Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

  • OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
  • Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
  • Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
  • Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

  • Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
  • Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
  • Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

  • Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
  • API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

  1. PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
  2. FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
  3. V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
  4. Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
  5. Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

  • [Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23064
  • [Misc] enhance static type hint by @andyxning in https://github.com/vllm-project/vllm/pull/23059
  • [Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in https://github.com/vllm-project/vllm/pull/23058
  • [Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in https://github.com/vllm-project/vllm/pull/23055
  • [Misc] refactor function name by @andyxning in https://github.com/vllm-project/vllm/pull/23029
  • [Misc] Fix backward compatibility from #23030 by @ywang96 in https://github.com/vllm-project/vllm/pull/23070
  • [XPU] Fix compile size for xpu by @jikunshang in https://github.com/vllm-project/vllm/pull/23069
  • [XPU][CI]add xpu env vars in CI scripts by @jikunshang in https://github.com/vllm-project/vllm/pull/22946
  • [Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23053
  • [Bugfix] fix IntermediateTensors equal method by @andyxning in https://github.com/vllm-project/vllm/pull/23027
  • [Refactor] Get prompt updates earlier by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23097
  • chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in https://github.com/vllm-project/vllm/pull/23090
  • [Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/23095
  • [CI Bugfix] Pin openai<1.100 to unblock CI by @mgoin in https://github.com/vllm-project/vllm/pull/23118
  • fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/23126
  • Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in https://github.com/vllm-project/vllm/pull/23008
  • Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in https://github.com/vllm-project/vllm/pull/23135
  • [Misc] Minor refactoring for prepare_inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23116
  • [Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23041
  • [Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in https://github.com/vllm-project/vllm/pull/23122
  • [CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/22871
  • [V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22776
  • chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in https://github.com/vllm-project/vllm/pull/23048
  • [TPU] make ptxla not imported when using tpu_commons by @yaochengji in https://github.com/vllm-project/vllm/pull/23081
  • [Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes by @nikheal2 in https://github.com/vllm-project/vllm/pull/22725
  • Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22023
  • [Log] Warning Once for Cutlass MLA by @yewentao256 in https://github.com/vllm-project/vllm/pull/23137
  • [Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23114
  • [misc] split engine_model into json file for nsys profile tool by @gracehonv in https://github.com/vllm-project/vllm/pull/23117
  • [Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn by @pliops-daniels in https://github.com/vllm-project/vllm/pull/22889
  • Fix GLM-4.5V-FP8 numerical issue by @zixi-qi in https://github.com/vllm-project/vllm/pull/22949
  • [Misc] Add request_id into benchmark_serve.py by @hustxiayang in https://github.com/vllm-project/vllm/pull/23065
  • [Bugfix] Fix broken Minimax-01-VL model by @Isotr0py in https://github.com/vllm-project/vllm/pull/22116
  • [bug fix] Fix llama4 spec decoding by @zixi-qi in https://github.com/vllm-project/vllm/pull/22691
  • [Misc] Avoid accessing req_ids inside a loop by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23159
  • [Doc] use power of 2 by @Tialo in https://github.com/vllm-project/vllm/pull/23172
  • [Misc] Fix seq_lens for graph capture by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23175
  • [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/21716
  • [Model] Add transformers problem_type (e.g. multi_label_classification) support by @noooop in https://github.com/vllm-project/vllm/pull/23173
  • [Model] support new model ovis2.5 by @myselvess in https://github.com/vllm-project/vllm/pull/23084
  • [Bugfix] Fix benchmark_moe.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/23177
  • [FEAT] [Performance] Enable DP for ViT in Qwen2.5VL by @tjtanaa in https://github.com/vllm-project/vllm/pull/22742
  • [Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock by @yiz-liu in https://github.com/vllm-project/vllm/pull/23169
  • Add return_token_ids parameter to OpenAI API endpoints by @ultmaster in https://github.com/vllm-project/vllm/pull/22587
  • Migrate LlavaOnevisionMultiInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21844
  • [CI/Build] Update transformers to v4.55.2 by @Isotr0py in https://github.com/vllm-project/vllm/pull/23093
  • [Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks by @tanruixiang in https://github.com/vllm-project/vllm/pull/22654
  • [Frontend] Add /collective_rpc API endpoint by @22quinn in https://github.com/vllm-project/vllm/pull/23075
  • [Misc] Enable yapf for FlashInfer backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23193
  • [Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. by @bnellnm in https://github.com/vllm-project/vllm/pull/23125
  • fix: use cache_salt for gpt-oss by @dr75 in https://github.com/vllm-project/vllm/pull/23186
  • [Misc] Minor refactoring for FlashInfer backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23147
  • [CI/Build] Add support for Python 3.13 by @mgoin in https://github.com/vllm-project/vllm/pull/13164
  • [NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend by @amirkl94 in https://github.com/vllm-project/vllm/pull/22357
  • [CI/Build] Replace lm-eval gsm8k tests with faster implementation by @mgoin in https://github.com/vllm-project/vllm/pull/23002
  • [BugFix] fix CUTLASS MLA full cudagraph by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/23200
  • [Benchmarks] Add video inputs to ShareGPTDataset. by @huachenheli in https://github.com/vllm-project/vllm/pull/23199
  • [Quantization] Bump Compressed Tensors Version by @kylesayrs in https://github.com/vllm-project/vllm/pull/23202
  • [Core] Optimize scheduler request removal for single completions by @chi2liu in https://github.com/vllm-project/vllm/pull/21917
  • [CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py by @mgoin in https://github.com/vllm-project/vllm/pull/23132
  • [Core] Add torch profiler CPU traces for AsyncLLM. by @huachenheli in https://github.com/vllm-project/vllm/pull/21794
  • [Doc] Update V1 status of various pooling models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23189
  • [Attention] Optimize make_local_attention_virtual_batches for Flash Attention by @linzebing in https://github.com/vllm-project/vllm/pull/23185
  • Fix a performance comparison issue in Benchmark Suite by @louie-tsai in https://github.com/vllm-project/vllm/pull/23047
  • chore: support pytorch format in lora by @KilJaeeun in https://github.com/vllm-project/vllm/pull/22790
  • [CI/Build] Also check DP in benchmarks throughput script by @zhewenl in https://github.com/vllm-project/vllm/pull/23038
  • [CI/Build] Sync multimodal tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23181
  • [BugFix] Fix stuck stats/metrics after requests are aborted by @njhill in https://github.com/vllm-project/vllm/pull/22995
  • fix cuda graph by @fsx950223 in https://github.com/vllm-project/vllm/pull/22721
  • [Model] use autoWeightsLoader for gptoss by @calvin0327 in https://github.com/vllm-project/vllm/pull/22446
  • Fix missing quotes by @wzshiming in https://github.com/vllm-project/vllm/pull/23242
  • [Model] Support deepseek with eagle by @xyang16 in https://github.com/vllm-project/vllm/pull/21086
  • [Bugfix] Ensure correctness of Cohere2Vision processing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23245
  • Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image by @mgoin in https://github.com/vllm-project/vllm/pull/23129
  • [Model][V1] Support Ernie MTP by @xyxinyang in https://github.com/vllm-project/vllm/pull/22169
  • [Model] Improve olmo and olmo2 by @jeejeelee in https://github.com/vllm-project/vllm/pull/23228
  • [Fix] fix offline env use local mode path by @lengrongfu in https://github.com/vllm-project/vllm/pull/22526
  • [Bugfix] Ensure correctness of HCXVision processing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23254
  • [Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute by @shixianc in https://github.com/vllm-project/vllm/pull/23045
  • [CLI][Doc] Formalize --mm-encoder-tp-mode by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23190
  • [Misc] Add max_seq_len to CommonAttentionMetadata by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23216
  • [FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER by @JartX in https://github.com/vllm-project/vllm/pull/22795
  • Support conditional torch.compile per module by @sarckk in https://github.com/vllm-project/vllm/pull/22269
  • Migrate Mistral3ImagePixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21945
  • Limit HTTP header count and size by @russellb in https://github.com/vllm-project/vllm/pull/23267
  • Small fix for Command-A-Vision by @dongluw in https://github.com/vllm-project/vllm/pull/23268
  • [Kernel/Quant] Remove the original marlin format and qqq by @mgoin in https://github.com/vllm-project/vllm/pull/23204
  • [Fix] correct tool_id for kimi-k2 when use tool_choice=required by @MoyanZitto in https://github.com/vllm-project/vllm/pull/21259
  • [Frontend] improve error logging of chat completion by @heheda12345 in https://github.com/vllm-project/vllm/pull/22957
  • [Optimization] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x by @misrasaurabh1 in https://github.com/vllm-project/vllm/pull/20413
  • Do not use eval() to convert unknown types by @russellb in https://github.com/vllm-project/vllm/pull/23266
  • [Feature] use --eplb_config to set eplb param by @lengrongfu in https://github.com/vllm-project/vllm/pull/20562
  • [misc] fix multiple arch wheels for the nightly index by @youkaichao in https://github.com/vllm-project/vllm/pull/23110
  • Remove chunked_prefill_enabled flag in V1 MLA by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23183
  • Feature/mla tests by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23195
  • [Fix] remove is_marlin param in benchmark_moe by @shixianc in https://github.com/vllm-project/vllm/pull/23286
  • [EP] Add logging for experts map by @22quinn in https://github.com/vllm-project/vllm/pull/22685
  • Remove duplicate entry in vllm.attention.all by @russellb in https://github.com/vllm-project/vllm/pull/23296
  • [CI Bugfix] Fix CI by fully removing --enable-prompt-adapter by @mgoin in https://github.com/vllm-project/vllm/pull/23284
  • [Optimization] Make new_block_ids None if empty by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23262
  • [CPU] Refactor CPU W8A8 scaled_mm by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23071
  • [CI/Build] Split out mm processor tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23260
  • [V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/23035
  • [Compile] Fix Compile Warning SM100 Cutlass MLA by @yewentao256 in https://github.com/vllm-project/vllm/pull/23287
  • [Model][VLM] Support R-4B Model by @yannqi in https://github.com/vllm-project/vllm/pull/23246
  • Delete images older than 24h. by @QiliangCui in https://github.com/vllm-project/vllm/pull/23291
  • [CI] Block the cu126 wheel build while broken by @mgoin in https://github.com/vllm-project/vllm/pull/23285
  • [Sampler] Support returning final logprobs by @22quinn in https://github.com/vllm-project/vllm/pull/22387
  • [Bugfix] Fix extra whitespace in strings caused by newline by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23272
  • [BugFix] Fix Python 3.9 Support by @jaredoconnell in https://github.com/vllm-project/vllm/pull/23306
  • [Model] Add LFM2 architecture by @paulpak58 in https://github.com/vllm-project/vllm/pull/22845
  • [Refactor] Simplify code for MM budget by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23310
  • [Doc] Fix batch-level DP example by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23325
  • [Performance] V1 Pooling Models E2E Performance Optimization by @noooop in https://github.com/vllm-project/vllm/pull/23162
  • [V1] Remove unnecessary check for main thread by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/23298
  • [Bugfix] set system_message in phi4mini chat template by @zhuangqh in https://github.com/vllm-project/vllm/pull/23309
  • [Multimodal] Always enable hashing mm data by @ywang96 in https://github.com/vllm-project/vllm/pull/23308
  • [ci/build] Fix abi tag for aarch64 by @youkaichao in https://github.com/vllm-project/vllm/pull/23329
  • Migrate MolmoImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22022
  • Fix nvfp4 swizzling by @yiliu30 in https://github.com/vllm-project/vllm/pull/23140
  • add tg-mxfp4-moe-test by @IwakuraRein in https://github.com/vllm-project/vllm/pull/22540
  • [Bug] Fix R1 Accuracy 0 Bug by @yewentao256 in https://github.com/vllm-project/vllm/pull/23294
  • [Bugfix] Fix port conflict by obtaining a list of open ports upfront by @minosfuture in https://github.com/vllm-project/vllm/pull/21894
  • [Misc] Misc code cleanup/simplification by @njhill in https://github.com/vllm-project/vllm/pull/23304
  • [BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message by @heheda12345 in https://github.com/vllm-project/vllm/pull/23318
  • [Misc] fix VLLM_TORCH_PROFILER_DIR to absolute path by @andyxning in https://github.com/vllm-project/vllm/pull/23191
  • [Core] Always use tensor cores for Flashinfer Decode Wrapper by @pavanimajety in https://github.com/vllm-project/vllm/pull/23214
  • Make sure that vectorize_with_alignment produced vectorized global loads by @elvircrn in https://github.com/vllm-project/vllm/pull/23182
  • [Structured Outputs] Refactor bitmask construction into get_grammar_bitmask by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23361
  • [CI] Clean up actions: remove helm, publish workflows and improve pr … by @simon-mo in https://github.com/vllm-project/vllm/pull/23377
  • [CI] improve pr comments bot by @simon-mo in https://github.com/vllm-project/vllm/pull/23380
  • [Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm by @mgoin in https://github.com/vllm-project/vllm/pull/23265
  • Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. by @tvalentyn in https://github.com/vllm-project/vllm/pull/23270
  • [Feature][Responses API] Support logprobs(non-stream) by @kebe7jun in https://github.com/vllm-project/vllm/pull/23319
  • [Core] Support custom executor qualname by @22quinn in https://github.com/vllm-project/vllm/pull/23314
  • [Kernel] Add FP8 support with FlashMLA backend by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/22668
  • [Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/18800
  • Migrate MllamaImagePixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22020
  • [CI/Build] Skip Idefics3 and SmolVLM generation test again by @Isotr0py in https://github.com/vllm-project/vllm/pull/23356
  • [Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement by @yewentao256 in https://github.com/vllm-project/vllm/pull/23351
  • [CI] Add end-to-end V1 min_tokens test coverage by @arjunbreddy22 in https://github.com/vllm-project/vllm/pull/22495
  • [Misc] Add gemma3 chat template with pythonic-style function calling by @philipchung in https://github.com/vllm-project/vllm/pull/17149
  • [New Model] Add Seed-Oss model by @FoolPlayer in https://github.com/vllm-project/vllm/pull/23241
  • [Attention] Refactor AttentionMetadata Preparation for Encoder-only Models by @heheda12345 in https://github.com/vllm-project/vllm/pull/23154
  • [P/D][Nixl] Make kv cache register compatible with hybrid memory allocator by @sfeng33 in https://github.com/vllm-project/vllm/pull/23079
  • [gpt-oss] add input/output usage in responses api when harmony context is leveraged by @gcalmettes in https://github.com/vllm-project/vllm/pull/22667
  • Migrate MiniCPMOAudioInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21847
  • [Bugfix] Fix pooling models on non-CUDA devices by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23392
  • [V0 Deprecation] Remove V0 LoRA test by @jeejeelee in https://github.com/vllm-project/vllm/pull/23418
  • [Misc] Move M-RoPE init logic to _init_mrope_positions by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23422
  • [Attention] Allow V1 flash_attn to support cross-attention by @russellb in https://github.com/vllm-project/vllm/pull/23297
  • [misc] Remove outdate comment about runai_model_streamer by @carlory in https://github.com/vllm-project/vllm/pull/23421
  • [Doc] Update the doc for log probs + prefix caching by @heheda12345 in https://github.com/vllm-project/vllm/pull/23399
  • [Misc] local import code clean by @andyxning in https://github.com/vllm-project/vllm/pull/23420
  • [Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script by @namanlalitnyu in https://github.com/vllm-project/vllm/pull/23375
  • [Fix] Bump triton version in rocm-build requirements by @bringlein in https://github.com/vllm-project/vllm/pull/21630
  • [Bugfix]: Installing dev environment due to pydantic incompatible version by @hickeyma in https://github.com/vllm-project/vllm/pull/23353
  • [Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support by @PapaGoose in https://github.com/vllm-project/vllm/pull/23337
  • [BugFix] Fix the issue where image embeddings were incorrectly split.… by @bppps in https://github.com/vllm-project/vllm/pull/23366
  • fix(tests): Ensure reliable CUDA cache clearing in MoE test by @AzizCode92 in https://github.com/vllm-project/vllm/pull/23416
  • Add unit tests for batched guided and non-guided requests by @sarckk in https://github.com/vllm-project/vllm/pull/23389
  • [Doc]: fix various typos in multiple files by @didier-durand in https://github.com/vllm-project/vllm/pull/23179
  • [Model] Add Ovis2.5 PP support by @Isotr0py in https://github.com/vllm-project/vllm/pull/23405
  • [Bugfix] Fix broken Florence-2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/23426
  • [Quantization] Allow GGUF quantization to skip unquantized layer by @Isotr0py in https://github.com/vllm-project/vllm/pull/23188
  • add an env var for path to pre-downloaded flashinfer cubin files by @842974287 in https://github.com/vllm-project/vllm/pull/22675
  • [CI/Build] add EP dependencies to docker by @zhewenl in https://github.com/vllm-project/vllm/pull/21976
  • [PERF] PyTorch Symmetric Memory All-Reduce by @ilmarkov in https://github.com/vllm-project/vllm/pull/20759
  • [BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model by @rasmith in https://github.com/vllm-project/vllm/pull/22281
  • [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/22703
  • [BugFix] Fix batch updates for pooling models by @njhill in https://github.com/vllm-project/vllm/pull/23398
  • [BugFix] Fix MinPLogitsProcessor.update_states() by @njhill in https://github.com/vllm-project/vllm/pull/23401
  • [Model] Support DP for ViT on MiniCPM-V-4 by @david6666666 in https://github.com/vllm-project/vllm/pull/23327
  • [UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh by @mgoin in https://github.com/vllm-project/vllm/pull/23360
  • Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs by @fengli1702 in https://github.com/vllm-project/vllm/pull/22527
  • Add glm4.5v tp2,4 fp8 config on H100_80GB by @chenxi-yang in https://github.com/vllm-project/vllm/pull/23443
  • Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23396
  • fix(tests): Correct unreachable assertion in truncation test by @AzizCode92 in https://github.com/vllm-project/vllm/pull/23425
  • Support DeepSeek-V3.1 tool call by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/23454
  • [Misc] Modify CacheConfig import by @jeejeelee in https://github.com/vllm-project/vllm/pull/23459
  • [gpt-oss] Streaming Output for Python Tool by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23409
  • Migrate Pixtral inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23472
  • [Bugfix] Add strong reference to CUDA pluggable allocator callbacks by @22quinn in https://github.com/vllm-project/vllm/pull/23477
  • Migrate Paligemma inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23470
  • [kernel] Support W4A8 on Hopper by @czhu-cohere in https://github.com/vllm-project/vllm/pull/23198
  • [Misc] update dict parse to EPLBConfig from json dumps to dict unpacking by @lengrongfu in https://github.com/vllm-project/vllm/pull/23305
  • (Misc): add missing test for zero truncation size. by @teekenl in https://github.com/vllm-project/vllm/pull/23457
  • [New Model]Donut model by @princepride in https://github.com/vllm-project/vllm/pull/23229
  • [Model] Enable BLOOM on V1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23488
  • [Misc] Remove unused slot_mapping buffer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23502
  • fix incompatibililty with non cuda platform for nvfp4 by @luccafong in https://github.com/vllm-project/vllm/pull/23478
  • [Doc: ]fix various typos in multiple files by @didier-durand in https://github.com/vllm-project/vllm/pull/23487
  • [Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 by @minosfuture in https://github.com/vllm-project/vllm/pull/23504
  • Frontend: Adding LM Format Enforcer support to V1 engine by @noamgat in https://github.com/vllm-project/vllm/pull/22564
  • [Bugfix] Fix Qwen2.5-VL quantized model weights loading by @zifeitong in https://github.com/vllm-project/vllm/pull/23512
  • [Misc] Unified linear print info by @jeejeelee in https://github.com/vllm-project/vllm/pull/23516
  • Migrate tarsier inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23500
  • Migrate skyworkr1v inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23499
  • Migrate DonutImagePixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23509
  • [Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) by @FFFfff1FFFfff in https://github.com/vllm-project/vllm/pull/23408
  • [gpt-oss] use reasoning channel for reasoning text in serving_chat by @yuguo68 in https://github.com/vllm-project/vllm/pull/22920
  • [Refactor] Dynamic target and content for prompt updates by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23411
  • [Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests by @fake0fan in https://github.com/vllm-project/vllm/pull/22711
  • [Fix] DeepSeek V3.1 tool parser error message by @skyloevil in https://github.com/vllm-project/vllm/pull/23492
  • Feature/benchmark/random mm data/images by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/23119
  • [Bugfix] Allow dynamic number of patches for llava_onevision by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23525
  • [misc] add shanghai meetup by @youkaichao in https://github.com/vllm-project/vllm/pull/23535
  • [Attention] Unify mamba and attention backend selection by @ayushsatyam146 in https://github.com/vllm-project/vllm/pull/23171
  • [Doc] Add caution for API server scale-out by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23550
  • [Refactor] Pass tokenizer explicitly instead of binding to prompt update by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23542
  • Updates to Flex + VLLm integration by @drisspg in https://github.com/vllm-project/vllm/pull/21416
  • [Bugfix] Fix Qwen3 MoE GPTQ inference by @Isotr0py in https://github.com/vllm-project/vllm/pull/23490
  • [Refactor] Refactor persistent buffers with CpuGpuBuffer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23515
  • [test][RL] Add sleep level 2 test and fix reload with sleep mode by @22quinn in https://github.com/vllm-project/vllm/pull/23521
  • [Kernel] Add fused grouped_topk kernel for MoE by @xyang16 in https://github.com/vllm-project/vllm/pull/23274
  • [Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector by @Abatom in https://github.com/vllm-project/vllm/pull/23403
  • [XPU] Delay BF16 check to worker init for spawn compatibility by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/22979
  • [TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. by @patemotter in https://github.com/vllm-project/vllm/pull/23574
  • [Docs] Update Documentation of Cohere Command-A Models by @Terrencezzj in https://github.com/vllm-project/vllm/pull/23584
  • [Misc] Simplify FlashInfer attention metadata by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23585
  • [Misc] Add release note draft to PR template by @simon-mo in https://github.com/vllm-project/vllm/pull/23598
  • [CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests by @mgoin in https://github.com/vllm-project/vllm/pull/23568
  • Update Flashinfer to 0.2.14.post1 by @weireweire in https://github.com/vllm-project/vllm/pull/23537
  • [Bug] Fix DeepGEMM Env Control by @yewentao256 in https://github.com/vllm-project/vllm/pull/23591
  • [CI/Build] Use vLLM client's user agent to fetch images by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23561
  • Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper by @Copilot in https://github.com/vllm-project/vllm/pull/23385
  • [Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT by @liuzijing2014 in https://github.com/vllm-project/vllm/pull/22760
  • [CI/Build] Fix typo in #23561 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23616
  • [fix] fix seed-oss-parser by @FoolPlayer in https://github.com/vllm-project/vllm/pull/23560
  • [mypy] Fix incorrect type hint for EAGLE3 support by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23617
  • [Benchmarks] add benchmark for embedding models by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23000
  • [Docs] Fix titles for multi-file examples that are rendered in the docs by @hmellor in https://github.com/vllm-project/vllm/pull/23573
  • Fix CLI parameter documentation inconsistency in pooling_models.md by @oneraghavan in https://github.com/vllm-project/vllm/pull/23630
  • [Bugfix] Fix Qwen25VL packed_modules_mapping by @jeejeelee in https://github.com/vllm-project/vllm/pull/23604
  • [Bugfix] Fix scheduling when repeated images in one request by @ywang96 in https://github.com/vllm-project/vllm/pull/23544
  • [V1] Enable V1 for compute capability < 8.0 + FP32 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23614
  • Fix nits from #20059 by @hmellor in https://github.com/vllm-project/vllm/pull/23548
  • Fix writing benchmark results with tuple keys by @huydhn in https://github.com/vllm-project/vllm/pull/23633
  • [Perf] Remove duplicated NVFP4 blockscales to save memory by @mgoin in https://github.com/vllm-project/vllm/pull/23379
  • [Model] fix DeepSeek e_score_correction_bias dtype to fp32 by @jeejeelee in https://github.com/vllm-project/vllm/pull/23640
  • [Bugfix] Add missing enable_log_outputs parameter to init_app_state function by @lordmathis in https://github.com/vllm-project/vllm/pull/23634
  • feat: add usage to TranscriptionResponse (text and json response_format) by @gcalmettes in https://github.com/vllm-project/vllm/pull/23576
  • Support FlashAttention Backend for Hybrid SSM Models by @heheda12345 in https://github.com/vllm-project/vllm/pull/23299
  • [Docs] Fix broken links to docs/api/summary.md by @hmellor in https://github.com/vllm-project/vllm/pull/23637
  • [Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) by @OYE93 in https://github.com/vllm-project/vllm/pull/23565
  • [Kernel] Added flashinfer fp8 per-tensor gemms by @nvjullin in https://github.com/vllm-project/vllm/pull/22895
  • [Doc]: fix various spelling issues in multiple files by @didier-durand in https://github.com/vllm-project/vllm/pull/23636
  • [CPU] add cpu fused moe pytorch native implementation by @TianyuLi0 in https://github.com/vllm-project/vllm/pull/23146
  • [ROCm] Starting to add AMD code reviewers for ROCm components by @hongxiayang in https://github.com/vllm-project/vllm/pull/23496
  • [Docs] Reduce requirements for docs build by @hmellor in https://github.com/vllm-project/vllm/pull/23651
  • [Bugfix] fix bf16 multimodal model hash by @yuekaizhang in https://github.com/vllm-project/vllm/pull/23623
  • [model] support qwen2audio embedding input by @yuekaizhang in https://github.com/vllm-project/vllm/pull/23625
  • [Misc] Add override for allreduce fusion thresholds by @nvjullin in https://github.com/vllm-project/vllm/pull/23639
  • [CI] [Doc]: Add GH Action for auto labeling issues with rocm tag by @vllmellm in https://github.com/vllm-project/vllm/pull/20988
  • [Bugfix] Fix cuda event usage with CPU model runner by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23643
  • [Docs] Fix warnings in mkdocs build by @Zerohertz in https://github.com/vllm-project/vllm/pull/23649
  • [Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models by @tdoublep in https://github.com/vllm-project/vllm/pull/23665
  • [v1] Add cross-attention KV cache support for encoder-decoder models by @russellb in https://github.com/vllm-project/vllm/pull/23664
  • [Bugfix] Fix incorrect original shape in hashing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23672
  • [Misc] Fix comments in tests/kernels/quantization by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23675
  • [Model] Enable video support for InternVL3.5 models by @Isotr0py in https://github.com/vllm-project/vllm/pull/23658
  • [doc] Hybrid KV Cache Manager design doc by @heheda12345 in https://github.com/vllm-project/vllm/pull/22688
  • Enhance the pre-notification policy by @sidhpurwala-huzaifa in https://github.com/vllm-project/vllm/pull/23532
  • [Docs] Move quant supported hardware table to README by @hmellor in https://github.com/vllm-project/vllm/pull/23663
  • [V1][P/D]P2pNcclConnector supports flashinfer by @Abatom in https://github.com/vllm-project/vllm/pull/23536
  • [V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 by @tdoublep in https://github.com/vllm-project/vllm/pull/22594
  • [Compile] Fix Cmake Warning by @yewentao256 in https://github.com/vllm-project/vllm/pull/23689
  • [Bugfix] UnboundLocalError when GptOss reasoning specified by @coval3nte in https://github.com/vllm-project/vllm/pull/23054
  • feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 by @zixuanzhang226 in https://github.com/vllm-project/vllm/pull/23695
  • [Feature][Responses API] Support MCP tool in background mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/23494
  • fix pynccl reduce_scatter by @youzhedian in https://github.com/vllm-project/vllm/pull/23648
  • [quantization] use channel scales for w4a8 + misc fixes by @czhu-cohere in https://github.com/vllm-project/vllm/pull/23570
  • [gpt-oss] Enable unit test for response API harmony integration by @heheda12345 in https://github.com/vllm-project/vllm/pull/23533
  • [Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 by @mgoin in https://github.com/vllm-project/vllm/pull/23678
  • [Docs] Fix math rendering in docs by @hmellor in https://github.com/vllm-project/vllm/pull/23676
  • [Bugfix][gpt-oss] passing the cache config in gpt-oss by @frank-wei in https://github.com/vllm-project/vllm/pull/23613
  • [Bugfix]: Qwen3 Coder Tool Parser by @ranpox in https://github.com/vllm-project/vllm/pull/23099
  • [Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. by @huachenheli in https://github.com/vllm-project/vllm/pull/23686
  • [Model] Add Ernie4.5 VL Model Support by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/22514
  • [Frontend] Add --log-error-stack to print stack trace for error response by @heheda12345 in https://github.com/vllm-project/vllm/pull/22960
  • [Frontend] Optimize beam search performance by limiting concurrency by @heheda12345 in https://github.com/vllm-project/vllm/pull/23599
  • [Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs by @dsikka in https://github.com/vllm-project/vllm/pull/22674
  • [XPU] Add xpu torch.compile support by @jikunshang in https://github.com/vllm-project/vllm/pull/22609
  • [CI/Build] Remove redundant LoRA model tests by @jeejeelee in https://github.com/vllm-project/vllm/pull/23706
  • [Bugfix] fix when config.yaml config value is list parse error by @lengrongfu in https://github.com/vllm-project/vllm/pull/23528
  • [Core] Use key-only cache for BaseMultiModalProcessor by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23018
  • [XPU]fix cuda event used in XPU model runner by @jikunshang in https://github.com/vllm-project/vllm/pull/23708
  • [CI/Build] Remove redundant register in model init tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23715
  • [Docs] Fix an admonition important by @windsonsea in https://github.com/vllm-project/vllm/pull/23726
  • Optimize input preparation for FlashInfer [2/N] by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23174
  • [Misc] Move CpuGpuBuffer to vllm/v1/utils.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23728
  • [FlashInfer] Cache hyper params in metadata builder by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23732
  • [CI/Build] Reduce LoRA layer test cases by @jeejeelee in https://github.com/vllm-project/vllm/pull/23721
  • [XPU] Fix OOM issue for data parallel with Ray backend by @faaany in https://github.com/vllm-project/vllm/pull/22500
  • [Docs] Fix a 1-2-3 list and style issues in tpu.md by @windsonsea in https://github.com/vllm-project/vllm/pull/23729
  • [model] Support MiniCPM-V 4.5 by @tc-mb in https://github.com/vllm-project/vllm/pull/23586
  • [Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled by @cndoit18 in https://github.com/vllm-project/vllm/pull/23718
  • [Misc] Remove unnecessary _send_reconfig_message() in core_client.py by @njhill in https://github.com/vllm-project/vllm/pull/23127
  • [V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models by @tdoublep in https://github.com/vllm-project/vllm/pull/23716
  • [Model] Explicit default_pooling_type interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23736
  • Add vLLM Korea Meetup in the README.md and meetups.md by @rebel-hongseok in https://github.com/vllm-project/vllm/pull/23746
  • Fix pre-commit on main by @hmellor in https://github.com/vllm-project/vllm/pull/23747
  • [Model] Interface to enable batch-level DP support by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23733
  • Only run get_attr_docs if generating help text by @hmellor in https://github.com/vllm-project/vllm/pull/23723
  • [Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt by @yewentao256 in https://github.com/vllm-project/vllm/pull/23666
  • [Model] Enable native HF format InternVL support by @Isotr0py in https://github.com/vllm-project/vllm/pull/23742
  • [Doc]: upgrade version of crate-ci tool for improved typo detection by @didier-durand in https://github.com/vllm-project/vllm/pull/23755
  • [LogitsProcs] Deduplicate built-in LP implementation logic by @njhill in https://github.com/vllm-project/vllm/pull/23362
  • [Docs] Remove in-tree Gaudi install instructions by @hmellor in https://github.com/vllm-project/vllm/pull/23628
  • [BugFix] Fix topk_softmax assert by @ProExpertProg in https://github.com/vllm-project/vllm/pull/19764
  • [Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23749
  • [V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models by @tdoublep in https://github.com/vllm-project/vllm/pull/22589
  • [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/23743
  • ci: Add arm64 docker build to release pipeline by @seemethere in https://github.com/vllm-project/vllm/pull/23210
  • Disable torch.compile for dynamic rope models in Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/23738
  • [Multimodal] Generate mm_hash based on request metadata when caching is turned off by @ywang96 in https://github.com/vllm-project/vllm/pull/23690
  • [V1][Mamba] - Enable V1 by default for Mamba Models by @Josephasafg in https://github.com/vllm-project/vllm/pull/23650
  • DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 by @zyongye in https://github.com/vllm-project/vllm/pull/23608
  • [Bugfix] Fix Marlin NVFP4 for modelopt by @mgoin in https://github.com/vllm-project/vllm/pull/23659
  • [Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/23595
  • [Bugfix] Fix for V1 priority scheduling crashes at preemption by @Hanchenli in https://github.com/vllm-project/vllm/pull/23713
  • Migrate Qwen inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23473
  • [Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 by @Shrey1306 in https://github.com/vllm-project/vllm/pull/23556
  • [Perf] Tune configs for triton block fp8 gemm H100/H200 by @mgoin in https://github.com/vllm-project/vllm/pull/23748
  • Gracefully handle edge cases in harmony utils by @Ithanil in https://github.com/vllm-project/vllm/pull/23155
  • [CI] make all multi-gpu weight loading tests run nightly by @killershrimp in https://github.com/vllm-project/vllm/pull/23792
  • Add deprecation warning for lora_extra_vocab_size by @ahengljh in https://github.com/vllm-project/vllm/pull/23635
  • [Transform] [Quantization] Add transforms to compressed tensors by @kylesayrs in https://github.com/vllm-project/vllm/pull/22486
  • [CI] enable idefics3 and fuyu-8b test in multimodal test by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23790
  • [Bugfix] when set offline model running error by @lengrongfu in https://github.com/vllm-project/vllm/pull/23711
  • [Kernel] cuda kernels for upcoming decode context parallel feature by @youzhedian in https://github.com/vllm-project/vllm/pull/23791
  • [New Model]: Support GteNewModelForSequenceClassification by @noooop in https://github.com/vllm-project/vllm/pull/23524
  • [Model] Add PP support and VLM backbone compatability for GPT-OSS by @Isotr0py in https://github.com/vllm-project/vllm/pull/23680
  • [FIXBUG] Add return_success parameter to moe_wna16_weight_loader function by @JartX in https://github.com/vllm-project/vllm/pull/22797
  • [Doc]: fix typos in .md files (including those of #23751) by @didier-durand in https://github.com/vllm-project/vllm/pull/23825
  • [CI/Build][Bugfix] Fix Qwen VL tests on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23818
  • [BugFix][Spec Decode] Use float64 for uniform_probs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23803
  • [Model] [gpt-oss] fix gpt-oss pp support by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23815
  • [Doc]: fix typos in Python scripts by @didier-durand in https://github.com/vllm-project/vllm/pull/23828
  • [Bugfix] Fix benchmark_moe.py for blockwise fp8. by @crischeng in https://github.com/vllm-project/vllm/pull/23823
  • [CI] Fix linting error on main by @tdoublep in https://github.com/vllm-project/vllm/pull/23835
  • [Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE by @nvpohanh in https://github.com/vllm-project/vllm/pull/23819
  • [Bugfix] Add fake mode around passes by @angelayi in https://github.com/vllm-project/vllm/pull/23349
  • [ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime by @jeanschmidt in https://github.com/vllm-project/vllm/pull/23757
  • Add scale_config.yml file for Meta autoscalers for GH Actions by @jeanschmidt in https://github.com/vllm-project/vllm/pull/23840
  • Migrate Llama4ImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22021
  • [ROCm][Aiter] Add triton fp8 bmm kernel for mla by @divakar-amd in https://github.com/vllm-project/vllm/pull/23264
  • [bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) by @He-Jingkai in https://github.com/vllm-project/vllm/pull/23829
  • [NVIDIA] Support SiluMul + NVFP4 quant fusion by @elvischenv in https://github.com/vllm-project/vllm/pull/23671
  • chore: build release image by default by @simon-mo in https://github.com/vllm-project/vllm/pull/23852
  • [BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23737
  • [V1] Enable prefill optimization for Gemma3n by @sarckk in https://github.com/vllm-project/vllm/pull/22628
  • [Log] Use Debug Once for DeepGEMM E8M0 When not Enabled by @yewentao256 in https://github.com/vllm-project/vllm/pull/23858
  • [V0 Deprecation] Remove V0 Samplers test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23862
  • [XPU] support data parallel for MoE models on XPU by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/22887
  • [Models] Improve iteration over layers by @lgeiger in https://github.com/vllm-project/vllm/pull/19497
  • [ROCm][Fix] Fix rocm build caused by #23791 by @charlifu in https://github.com/vllm-project/vllm/pull/23847
  • [tests] Improve speed and reliability of test_transcription_api_correctness by @russellb in https://github.com/vllm-project/vllm/pull/23854
  • [Bugfix] Use ReplicatedLinear for SequenceClassification head by @Isotr0py in https://github.com/vllm-project/vllm/pull/23836
  • [BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD by @KingsleyZhang123 in https://github.com/vllm-project/vllm/pull/23864
  • [Platform] import activation_quant_fusion for CUDA only by @wangxiyuan in https://github.com/vllm-project/vllm/pull/23882
  • Fix(async): Add support for truncate_prompt_tokens in AsyncLLM by @oneraghavan in https://github.com/vllm-project/vllm/pull/23800
  • [CI/Build] Clean up LoRA test by @jeejeelee in https://github.com/vllm-project/vllm/pull/23890
  • [mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. by @huachenheli in https://github.com/vllm-project/vllm/pull/23895
  • [Misc] Fix warnings for mistral model by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23552
  • Better errors for Transformers backend missing features by @hmellor in https://github.com/vllm-project/vllm/pull/23759
  • [V0 Deprecation] Remove pooling model support in V0 by @maxdebayser in https://github.com/vllm-project/vllm/pull/23434
  • [CPU] Enable data parallel for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23903
  • [Performance] V1 Classify Models E2E Performance Optimization by @noooop in https://github.com/vllm-project/vllm/pull/23541
  • [Multimodal] Consolidate mm inputs into MultiModalFeatureSpec by @sfeng33 in https://github.com/vllm-project/vllm/pull/23779
  • Update PyTorch to 2.8.0 by @huydhn in https://github.com/vllm-project/vllm/pull/20358
  • Adds json_count_leaves utility function by @aditchawdhary in https://github.com/vllm-project/vllm/pull/23899
  • [MODEL] Apertus and XIELU by @EduardDurech in https://github.com/vllm-project/vllm/pull/23068
  • [Models] Use in-place adds in Idefics2Vision by @lgeiger in https://github.com/vllm-project/vllm/pull/23932
  • [BugFix] Async scheduling and PP compatibility with DP by @njhill in https://github.com/vllm-project/vllm/pull/23770
  • [CI] Add aiter to matching list of issue auto labeller for rocm tag by @vllmellm in https://github.com/vllm-project/vllm/pull/23942
  • [BUGFIX ] fix undefined silu_and_mul_nvfp4_quant by @youzhedian in https://github.com/vllm-project/vllm/pull/23929
  • [RL][BugFix] Fix missing tokenizer error for token-in-token-out by @22quinn in https://github.com/vllm-project/vllm/pull/23904
  • Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj by @mgoin in https://github.com/vllm-project/vllm/pull/23939
  • [Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models by @tdoublep in https://github.com/vllm-project/vllm/pull/23824
  • Revert gemma3n fast prefill changes by @sarckk in https://github.com/vllm-project/vllm/pull/23897
  • [Misc] Make download_weights_from_hf more reliable by @hmellor in https://github.com/vllm-project/vllm/pull/23863
  • [CI] Fix unavailable image remote URL by @ywang96 in https://github.com/vllm-project/vllm/pull/23966
  • [Bugfix] Fix --config arg expansion called from api_server.py by @dubejf in https://github.com/vllm-project/vllm/pull/23944
  • Add routed_scaling_factor to MoE grouped topk by @xyang16 in https://github.com/vllm-project/vllm/pull/23123
  • [CI] Move testing image from remote URL to S3 by @ywang96 in https://github.com/vllm-project/vllm/pull/23980
  • [CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion by @sarckk in https://github.com/vllm-project/vllm/pull/23973
  • [Core] Cleanup TPU model runner for MM by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23894
  • [V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba by @tdoublep in https://github.com/vllm-project/vllm/pull/23831
  • [Bugfix] Fix test_lora_resolvers.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/23984
  • [UT] fix unify_kv_cache_configs when kv cache config needs sort by @andyxning in https://github.com/vllm-project/vllm/pull/23843
  • [Model] Enable encoder DP for MiniCPM-V by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23948
  • Add LoRA support for DeepSeek models (V2, V3, R1-0528) by @sadeghja1070 in https://github.com/vllm-project/vllm/pull/23971
  • [Misc] add reorder_batch AttentionMetadataBuilder by @andyxning in https://github.com/vllm-project/vllm/pull/23798
  • [Refactor] refactor freezing_value/cuda_event initialize outside try finally by @andyxning in https://github.com/vllm-project/vllm/pull/23758
  • [Misc] enhance type hint for rearrange return value by @andyxning in https://github.com/vllm-project/vllm/pull/23519
  • [LoRA] Much faster startup when LoRA is enabled by @andylolu2 in https://github.com/vllm-project/vllm/pull/23777
  • Fix wrong truncate_prompt_tokens type hint by @gmarinho2 in https://github.com/vllm-project/vllm/pull/22761
  • [Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. by @ywang96 in https://github.com/vllm-project/vllm/pull/23394
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24001
  • vllm fix check on max vocab size by @xw285cornell in https://github.com/vllm-project/vllm/pull/22471
  • [Minor] Fix some random typos in comments by @njhill in https://github.com/vllm-project/vllm/pull/24009
  • v1: Support KV events from connectors by @orozery in https://github.com/vllm-project/vllm/pull/19737
  • [BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in https://github.com/vllm-project/vllm/pull/23994
  • [Misc] Avoid redundant copy for encoder-only models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24012
  • Fix the bug related to loading GPTP INT3 weights. by @Jun-Howie in https://github.com/vllm-project/vllm/pull/23328
  • [Misc] Move fast prefill logic to separate method by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24013
  • [CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization by @Isotr0py in https://github.com/vllm-project/vllm/pull/23357
  • [Misc] refactor code by import as for torch._inductor.config by @andyxning in https://github.com/vllm-project/vllm/pull/23677
  • Migrate Phi4 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23471
  • [Misc] IO Processor plugins for pooling models by @christian-pinto in https://github.com/vllm-project/vllm/pull/22820
  • [Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser by @DevonPeroutky in https://github.com/vllm-project/vllm/pull/22769
  • [Misc] add hash_function doc string by @andyxning in https://github.com/vllm-project/vllm/pull/24014
  • [Misc] Enable V1 FP16 inference on pre-Ampere GPUs by @Isotr0py in https://github.com/vllm-project/vllm/pull/24022
  • [Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN by @noooop in https://github.com/vllm-project/vllm/pull/20904
  • [Kernel] Update DeepGEMM to latest commit by @jeejeelee in https://github.com/vllm-project/vllm/pull/23915
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24026
  • [Frontend] Gemma3n audio transcriptions/translations endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/23735
  • [Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors by @yankay in https://github.com/vllm-project/vllm/pull/24033
  • [Model]: support KeyeVL-1_5-8B by @Kwai-Keye in https://github.com/vllm-project/vllm/pull/23838
  • Document multi-proc method selection for profiling by @hypdeb in https://github.com/vllm-project/vllm/pull/23802
  • [Misc] Minor code simplification for spec decode by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24053
  • [docs][misc] IOProcessor plugins fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24046
  • [Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 by @david6666666 in https://github.com/vllm-project/vllm/pull/23817
  • [Chore][V0 Deprecation] Move LogProb to a separate file by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24055
  • [bugfix]fix MTP hidden states by @luccafong in https://github.com/vllm-project/vllm/pull/24056
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24042
  • [V1][Mamba1] - FP32 SSM Kernel Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/23506
  • [Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… by @DamonJiang777 in https://github.com/vllm-project/vllm/pull/24028
  • Remove runtime checks based on pooling params by @maxdebayser in https://github.com/vllm-project/vllm/pull/24051
  • Migrate OvisImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22024
  • [XPU][Feature] fp8 online quantization support for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/23148
  • Migrate Interns1 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23510
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24077
  • [Model] Support dp on ViT on GLM-4.5V by @david6666666 in https://github.com/vllm-project/vllm/pull/23168
  • [CI]: reduce HTTP calls inside entrypoints openai tests by @AzizCode92 in https://github.com/vllm-project/vllm/pull/23646
  • correct LWS deployment yaml by @cberge908 in https://github.com/vllm-project/vllm/pull/23104
  • [Gemma3n] Fix audio batching by @NickLucche in https://github.com/vllm-project/vllm/pull/24052
  • [BugFix] Fix EXAONE4 rotary embeddings by @lkm2835 in https://github.com/vllm-project/vllm/pull/23918
  • [Model] Classification models support logit_bias / sigmoid_normalize by @noooop in https://github.com/vllm-project/vllm/pull/24031
  • [CI Failure] Skip failing nvfp4 silu test by @mgoin in https://github.com/vllm-project/vllm/pull/23959
  • [docs] add SYS_NICE cap & security-opt for docker/k8s by @panpan0000 in https://github.com/vllm-project/vllm/pull/24017
  • [Benchmark] Add support for local hf dataset path in benchmark by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23999
  • [Bugfix] Fix transform_config parsing in Compressed Tensors by @kylesayrs in https://github.com/vllm-project/vllm/pull/23945
  • Run ruff format on a few files. by @huachenheli in https://github.com/vllm-project/vllm/pull/24075
  • [Bugfix] Fix packed_factor missing attribute error by @kyuyeunk in https://github.com/vllm-project/vllm/pull/23902
  • [Metrics] Deprecate TPOT in favor of ITL by @markmc in https://github.com/vllm-project/vllm/pull/24110
  • Fix weights loading for Apertus by @nathanrchn in https://github.com/vllm-project/vllm/pull/24100
  • [Log] Only Print Profiler Results on Rank 0 by @yewentao256 in https://github.com/vllm-project/vllm/pull/23370
  • [CI] Enable all hf transformers baselines in test_hybrid by @tdoublep in https://github.com/vllm-project/vllm/pull/23936
  • [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault by @rasmith in https://github.com/vllm-project/vllm/pull/23692
  • [Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24119
  • [CI/Build] Disable SiluMul NVFP4 quant fusion tests by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24121
  • [XPU] Fix the bug of LoRA logits on the XPU platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/24081
  • Update release pipeline post PyTorch 2.8.0 update by @youkaichao in https://github.com/vllm-project/vllm/pull/24073
  • Upgrade xgrammar to 0.1.23 by @russellb in https://github.com/vllm-project/vllm/pull/22988
  • [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing by @afeldman-nm in https://github.com/vllm-project/vllm/pull/23656
  • fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24071
  • [Compile] Fix Compile Warning for w4a8_mm_entry.cu by @yewentao256 in https://github.com/vllm-project/vllm/pull/23660
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24093
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24115
  • [Misc] Add check for dual_chunk_attention by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24070
  • [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models by @sarckk in https://github.com/vllm-project/vllm/pull/24132
  • [distributed][rl] remove nccl cumem env var override by @youkaichao in https://github.com/vllm-project/vllm/pull/24141
  • [Nixl] Heterogeneous TP support FlashInfer by @NickLucche in https://github.com/vllm-project/vllm/pull/20189
  • [CI/Build] Serve images used by multimodal tests through local HTTP Server by @divyanshsinghvi in https://github.com/vllm-project/vllm/pull/23907
  • [Misc] Clean up deadcode for legacy processing pipeline by @Isotr0py in https://github.com/vllm-project/vllm/pull/24153
  • [CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant by @noooop in https://github.com/vllm-project/vllm/pull/24088
  • Support add_generation_prompt in embeddings endpoint with chat request by @biba10 in https://github.com/vllm-project/vllm/pull/23931
  • Fix MiniMax attention module prefix and remove useless code by @qscqesze in https://github.com/vllm-project/vllm/pull/23982
  • FIX: Add libnuma-dev to Dockerfile for dev stage by @dongbo910220 in https://github.com/vllm-project/vllm/pull/20388
  • [Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 by @bringlein in https://github.com/vllm-project/vllm/pull/23424
  • [V1] v1 engine + full CUDA graph support for PLaMo2 by @nopperl in https://github.com/vllm-project/vllm/pull/23998
  • [Kernels] Overlap shared experts with send/recv by @bnellnm in https://github.com/vllm-project/vllm/pull/23273
  • Migrate whisper inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23505
  • [Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23289
  • [Feature][P/D]: Optimize NIXL Connector xfer Launch by @david6666666 in https://github.com/vllm-project/vllm/pull/23887
  • [Bugfix][DP] DP distribution does not require ray[default] by @kebe7jun in https://github.com/vllm-project/vllm/pull/23822
  • [Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking by @NagyGeorge in https://github.com/vllm-project/vllm/pull/23460
  • Remove deprecated PyNcclConnector by @panpan0000 in https://github.com/vllm-project/vllm/pull/24151
  • [Feature][Responses API]Support MCP tools with streaming mode + background mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/23927
  • [Kernel][Bugfix] Fix grouped topk cu by @mayuyuace in https://github.com/vllm-project/vllm/pull/24146
  • [Refactor] Introduce basic Renderer for completion-style request by @sfeng33 in https://github.com/vllm-project/vllm/pull/24010
  • Migrate ultravox inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23503
  • [CPU] Refactor CPU unquantized linear by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24150
  • [Misc] Enhance output readability of helper script by @wdhongtw in https://github.com/vllm-project/vllm/pull/24214
  • [Model] Add MiDashengLM model support by @bingchen-mi in https://github.com/vllm-project/vllm/pull/23652
  • [Core][Model] Terratorch backend integration by @mgazz in https://github.com/vllm-project/vllm/pull/23513
  • Improve flexibility of auto_tune.sh execution. by @anthonsu in https://github.com/vllm-project/vllm/pull/23766
  • [Attention][Platform] Refactor MLA to support Custom Op by @whx-sjtu in https://github.com/vllm-project/vllm/pull/23332
  • [Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0 by @faaany in https://github.com/vllm-project/vllm/pull/24159
  • [Attention] FlashAttn MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/14258
  • [Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon by @ignaciosica in https://github.com/vllm-project/vllm/pull/24200
  • [Feature][Response API] Add streaming support for non-harmony by @kebe7jun in https://github.com/vllm-project/vllm/pull/23741
  • [Doc] Update vLLM Singapore Meetup info by @tjtanaa in https://github.com/vllm-project/vllm/pull/24234
  • [Model] Add pp support for hunyuan by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24212
  • Use hidden_size_per_head as head_size fallback by @nopperl in https://github.com/vllm-project/vllm/pull/24221
  • [XPU] support Triton Attention backend on Intel GPU by @jikunshang in https://github.com/vllm-project/vllm/pull/24149
  • [LoRA]: Add lora support to qwen-2.5-omni by @pratapyash in https://github.com/vllm-project/vllm/pull/24231
  • [Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp by @nvjullin in https://github.com/vllm-project/vllm/pull/23725
  • [Perf] Freeze core engine proc heap after init by @njhill in https://github.com/vllm-project/vllm/pull/24008
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24173
  • [Misc] Slight improve deepgemm print by @jeejeelee in https://github.com/vllm-project/vllm/pull/24085
  • Upgrade FlashInfer to v0.3.0 by @nvpohanh in https://github.com/vllm-project/vllm/pull/24086
  • QWEN3 Coder Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24266
  • [Misc] Have AsyncLLM custom_stat_loggers extend default logger list by @eicherseiji in https://github.com/vllm-project/vllm/pull/20952
  • [Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files by @elvischenv in https://github.com/vllm-project/vllm/pull/23727
  • [CI/Build] Reduce the number of redundant cases to test for LoRA by @zhuohan123 in https://github.com/vllm-project/vllm/pull/24276
  • [Frontend] Skip unnecessary detokenization when token_id is requested by @NickLucche in https://github.com/vllm-project/vllm/pull/24236
  • [gpt-oss] tool parser supports for /chat/completions [1/n] by @aarnphm in https://github.com/vllm-project/vllm/pull/22386
  • [XPU][P/D] Add XPU support in NixlConnector by @zhenwei-intel in https://github.com/vllm-project/vllm/pull/22436
  • Adding int4 and int8 models for CPU benchmarking by @louie-tsai in https://github.com/vllm-project/vllm/pull/23709
  • [docs] add shenzhen meetup by @youkaichao in https://github.com/vllm-project/vllm/pull/24326
  • [gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24306
  • [Bugfix] Fix silu_mul+quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24341
  • [RFC] allow cancelation after shutdown in blocking collective_rpc by @842974287 in https://github.com/vllm-project/vllm/pull/23390
  • [CI] Add timeouts to tests by @rafvasq in https://github.com/vllm-project/vllm/pull/24260
  • [Perf][V1] Fully overlap model execution by @benchislett in https://github.com/vllm-project/vllm/pull/23569
  • Add @22quinn as code reviewer for RL related components by @22quinn in https://github.com/vllm-project/vllm/pull/24346
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24294
  • [KV Sharing] Raise error if using eagle with fast prefill by @sarckk in https://github.com/vllm-project/vllm/pull/24350
  • [Feature] Support Decode Context Parallel (DCP) for MLA by @youzhedian in https://github.com/vllm-project/vllm/pull/23734
  • [Bugfix] Catch and log invalid token ids in detokenizer by @njhill in https://github.com/vllm-project/vllm/pull/24351
  • [Core] Allow disabling TP sharding for parallel Linear layer by @Isotr0py in https://github.com/vllm-project/vllm/pull/23024
  • [New Model]: google/embeddinggemma-300m by @noooop in https://github.com/vllm-project/vllm/pull/24318
  • refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24345
  • [Multimodal] Improve max video embedding length estimation in V1 by @ywang96 in https://github.com/vllm-project/vllm/pull/24312
  • [CI] Disable flaky structured output test from CI by @ywang96 in https://github.com/vllm-project/vllm/pull/24366
  • Add @benchislett to codeowner for spec decode and structured outputs by @benchislett in https://github.com/vllm-project/vllm/pull/24362
  • [Bugfix] Avoid uninitialized usage of azp_val when AZP is false. by @mohankku in https://github.com/vllm-project/vllm/pull/24335
  • [Bugfix] Fix broken deepseek fp8 TP weights loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/24367
  • [Bugfix] Fix test_mixtral_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/24371
  • Lora bias(enable_lora_bias) deprecate warning by @ashwin-phadke in https://github.com/vllm-project/vllm/pull/24339
  • [Fix] [gpt-oss] fix non-tool calling path for chat completion by @aarnphm in https://github.com/vllm-project/vllm/pull/24324
  • [Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24285
  • [Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24370
  • break execute_model in gpu_model_runner into sub-functions for custom scopes by @bangshengtang in https://github.com/vllm-project/vllm/pull/24265
  • [V0 deprecation] Deprecate V0 Neuron backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21159
  • [attention][DCP] use AttentionImpl.need_to_return_lse_for_decode by @youkaichao in https://github.com/vllm-project/vllm/pull/24372
  • Migrate Qwen2 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23475
  • [CI][Fix] deterministic seed for flaky CI runs on structured outputs by @aarnphm in https://github.com/vllm-project/vllm/pull/24380
  • [Benchmark] add benchmark for custom activation op by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23908
  • QWEN3 Thinking Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24330
  • [Misc] collect flashinfer version in collect_env.py by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24378
  • [Bugfix] Fix Qwen3-coder moe tuned config by @jeejeelee in https://github.com/vllm-project/vllm/pull/24072
  • [TPU] Remove TopKTopPSampler dependency for TPU sampler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24391
  • Add renderer-based prompt processing for embedding and classification endpoints by @sfeng33 in https://github.com/vllm-project/vllm/pull/24356
  • Skip MM Encoder for non-first PP ranks by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24387
  • Add @luccafong to codeowner for spec decode by @luccafong in https://github.com/vllm-project/vllm/pull/24397
  • [Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA by @minosfuture in https://github.com/vllm-project/vllm/pull/24385
  • [xpu] upgrade ipex/python3.12 for xpu by @yma11 in https://github.com/vllm-project/vllm/pull/23830
  • [Sampler] Support returning all prompt logprobs by @charlotte12l in https://github.com/vllm-project/vllm/pull/23868
  • [CI/Build] Disable flaky test_structured_output tests by @22quinn in https://github.com/vllm-project/vllm/pull/24404
  • [CI/Build] Fix local image inputs in test_pixtral.py by @huachenheli in https://github.com/vllm-project/vllm/pull/24401
  • [Doc] Fix UTF-8 encoding issues in documentation generation on Windows by @alhridoy in https://github.com/vllm-project/vllm/pull/24361
  • [P/D] Add a shutdown method to the Connector API by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22699
  • [Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24332
  • [Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24334
  • [gpt-oss][Responses API] Fix the function call id format by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24409
  • [Docs] Fix a tip indentation and typo by @windsonsea in https://github.com/vllm-project/vllm/pull/24419
  • [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24417
  • [Doc] Fix issues in integrations/llamastack.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24428
  • [Bugfix] Fix get_quant_config when using modelscope by @Potabk in https://github.com/vllm-project/vllm/pull/24421
  • [Bugfix] Fix mamba2 prefill chunking by @tomeras91 in https://github.com/vllm-project/vllm/pull/23279
  • [Misc] Terratorch related fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24337
  • Move KVEventsConfig from config/__init__.py to config/kv_events.py by @hmellor in https://github.com/vllm-project/vllm/pull/24433
  • [Frontend] User-provided uuids for medias in chat. (RFC #22044) by @huachenheli in https://github.com/vllm-project/vllm/pull/23449
  • [Docs] Move feature compatibility tables to README by @hmellor in https://github.com/vllm-project/vllm/pull/24431
  • [Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure by @didier-durand in https://github.com/vllm-project/vllm/pull/24438
  • [Docs]add eplb_config param use docs by @lengrongfu in https://github.com/vllm-project/vllm/pull/24213
  • [Model] Enable BNB support for qwen2_5_omni_thinker by @jeejeelee in https://github.com/vllm-project/vllm/pull/24420
  • [Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23563
  • [Spec Decode][Benchmark] Add Blitzedit dataset by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23605
  • [Model] Remove quantized mixtral by @jeejeelee in https://github.com/vllm-project/vllm/pull/24437
  • [CI] Enable encoder model compilation test by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24442
  • [Model loader]: support multi-thread model weight loading by @BraveY in https://github.com/vllm-project/vllm/pull/23928
  • [Spec Decode] Fix offline spec_decode.py by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24257
  • [Attention] FlashAttention MLA cudagraph support by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23958
  • [Bugfix] Disable the statslogger if the api_server_count is greater than 1 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22227
  • [Hardware][IBM Z] Fix Outlines Core issue for s390x by @R3hankhan123 in https://github.com/vllm-project/vllm/pull/24034
  • [CI] Add nightly multiarch manifests to dockerhub by @csahithi in https://github.com/vllm-project/vllm/pull/24102
  • Update reviewers for modelopt related files by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/24468
  • [Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24134
  • [gpt-oss] Harmony changes with container tool support by @morgendave in https://github.com/vllm-project/vllm/pull/23386
  • Bump actions/setup-python from 5.4.0 to 6.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24414
  • [doc] update vllm serve cli args documentation by @cjackal in https://github.com/vllm-project/vllm/pull/24329
  • Bump actions/stale from 9.1.0 to 10.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24412
  • Bump actions/github-script from 7.0.1 to 8.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24413
  • Move KVTransferConfig from config/__init__.py to config/kv_transfer.py by @hmellor in https://github.com/vllm-project/vllm/pull/24434
  • [BugFix][Model] Fix Ernie4.5-VL hanging on long inputs by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/24074
  • [Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/23647
  • [Core] Use sha256 bytes instead of BlockHash to reduce GC overhead by @linzebing in https://github.com/vllm-project/vllm/pull/23673
  • Add data_parallel_size to VllmConfig string representation by @Prowindy in https://github.com/vllm-project/vllm/pull/24298
  • [Bugfix] Fix Apertus HF repo name by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24447
  • [Misc] Improve Worker process title and logging prefix by @22quinn in https://github.com/vllm-project/vllm/pull/22205
  • [Doc] mention fpdb for multiprocess breakpoints by @mickaelseznec in https://github.com/vllm-project/vllm/pull/24452
  • [Misc] Support bench serve long context by @minosfuture in https://github.com/vllm-project/vllm/pull/24373
  • [Doc]: fixing typos to improve docs by @didier-durand in https://github.com/vllm-project/vllm/pull/24480
  • [Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer by @david6666666 in https://github.com/vllm-project/vllm/pull/24443
  • [Misc] Add claude settings to gitignore by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24492
  • [Misc] Add Codex settings to gitignore by @ywang96 in https://github.com/vllm-project/vllm/pull/24493
  • [gpt-oss] Validate gpt-oss python tool during initialization by @heheda12345 in https://github.com/vllm-project/vllm/pull/23856
  • [RL] fast weight update with zmq + ipc handles by @weixiao-huang in https://github.com/vllm-project/vllm/pull/24295
  • [CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24411
  • [Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT by @yewentao256 in https://github.com/vllm-project/vllm/pull/24123
  • [Model] Systematic support for fp32 head, pooling models part by @noooop in https://github.com/vllm-project/vllm/pull/23810
  • [Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop str and eos token by @dtransposed in https://github.com/vllm-project/vllm/pull/23938
  • [Core] Run garbage collector after CUDA graph capture to fix throughput regression by @micah-wil in https://github.com/vllm-project/vllm/pull/24128
  • [Kernels] Add Flash Linear Attention Kernels by @youkaichao in https://github.com/vllm-project/vllm/pull/24518
  • [ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork by @gshtras in https://github.com/vllm-project/vllm/pull/24279
  • [Bugfix] Fix hidden_size for multimodal classification model by @jeejeelee in https://github.com/vllm-project/vllm/pull/24501
  • Extend renderer with embedding support and integrate completion endpoint by @sfeng33 in https://github.com/vllm-project/vllm/pull/24405
  • [Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 by @serihiro in https://github.com/vllm-project/vllm/pull/24368
  • [Docs] Gemma3n transcriptions endpoint support by @NickLucche in https://github.com/vllm-project/vllm/pull/24512
  • [TPU] Fix tpu structured decoding in mixed batches by @Chenyaaang in https://github.com/vllm-project/vllm/pull/24458
  • [CI] execute all piecewise compilation tests together by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24502
  • [Feature] Disallow FlashMLA on Blackwell by @yewentao256 in https://github.com/vllm-project/vllm/pull/24521
  • [Log] Use a relative path in debug-level logs to distinguish files with identical names by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23846
  • [Benchmark] Update bench doc with mtbench, blazedit, spec bench by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24450
  • [Benchmark] Add option to skip oversampling in benchmark by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24457
  • [ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm by @charlifu in https://github.com/vllm-project/vllm/pull/24275
  • [Bugfix] Improve EPLB config validation error message by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24524
  • [Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. by @bnellnm in https://github.com/vllm-project/vllm/pull/24538
  • [Perf] Convert np array to torch tensor to index into block table for attn chunking by @sarckk in https://github.com/vllm-project/vllm/pull/24474
  • Add @heheda12345 to CODEOWNERS of KVCacheManager related code by @heheda12345 in https://github.com/vllm-project/vllm/pull/24546
  • [CI] Retry flaky fp8 cutlass mla tests by @njhill in https://github.com/vllm-project/vllm/pull/24536
  • [Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) by @ignaciosica in https://github.com/vllm-project/vllm/pull/24129
  • [BugFix] Fix async core engine client finalizer by @njhill in https://github.com/vllm-project/vllm/pull/24540
  • [CI] Adjust threshold for flaky ngram spec decoding test by @njhill in https://github.com/vllm-project/vllm/pull/24528
  • [KV Connector] More async support for get_num_new_matched_tokens by @ApostaC in https://github.com/vllm-project/vllm/pull/23620
  • [P/D] MultiConnector supports shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24425
  • [BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 by @wwl2755 in https://github.com/vllm-project/vllm/pull/24392
  • [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading by @frank-wei in https://github.com/vllm-project/vllm/pull/24154
  • [Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. by @huachenheli in https://github.com/vllm-project/vllm/pull/24271
  • [Bugfix] Update Run:AI Model Streamer Loading Integration by @pwschuurman in https://github.com/vllm-project/vllm/pull/23845
  • [Docs] Enable relative links in examples to function when rendered in the docs by @hmellor in https://github.com/vllm-project/vllm/pull/24041
  • [docs] promo pytorch conf and ray summit by @simon-mo in https://github.com/vllm-project/vllm/pull/24562
  • [Bugfix] Guard _may_reorder_batch for encoder-only models on CPU (#24319) by @comsky in https://github.com/vllm-project/vllm/pull/24348
  • Consolidate rendering parameters into RenderConfig dataclass by @sfeng33 in https://github.com/vllm-project/vllm/pull/24543
  • [Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. by @li-jinpeng in https://github.com/vllm-project/vllm/pull/24519
  • [Attention] add DCP support for FLASH_ATTN_MLA backend by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24453
  • [ROCm][Bugfix] Fix Aiter RMSNorm by @vllmellm in https://github.com/vllm-project/vllm/pull/23412
  • [Docs] Improve organisation of API Reference nav by @hmellor in https://github.com/vllm-project/vllm/pull/24569
  • [Docs] Document the extra memory footprint overhead when using EPLB by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24537
  • Support for NemotronH Nano VLM by @danielafrimi in https://github.com/vllm-project/vllm/pull/23644
  • Feature/vit attention unification# 23880 by @baonudesifeizhai in https://github.com/vllm-project/vllm/pull/23978
  • [LoRA]: Add LoRA support to Mistral's Voxtral models by @pratapyash in https://github.com/vllm-project/vllm/pull/24517
  • Move LoadConfig from config/__init__.py to config/load.py by @hmellor in https://github.com/vllm-project/vllm/pull/24566
  • [BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo by @wwl2755 in https://github.com/vllm-project/vllm/pull/24559
  • [BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat by @lacora in https://github.com/vllm-project/vllm/pull/24549
  • [BugFix] Ensure integrity of reused CPU tensors during async scheduling by @njhill in https://github.com/vllm-project/vllm/pull/24527
  • [CI/Build] split true unit tests to Entrypoints Unit Tests by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24418
  • [rocm] enable torchao quantization for rocm by @draftbk in https://github.com/vllm-project/vllm/pull/24400
  • [CI] Add PPL test for generation models by @noooop in https://github.com/vllm-project/vllm/pull/24485
  • [CI/Build] bump timm dependency by @dtrifiro in https://github.com/vllm-project/vllm/pull/24189
  • fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24167
  • Fix Auto_Round Quatization Loading on SM75 and Lower GPUs by @RoadToNowhereX in https://github.com/vllm-project/vllm/pull/24217
  • [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24092
  • [BugFix] python collect_env.py and vllm collect-env compatibility with uv venv by @yankay in https://github.com/vllm-project/vllm/pull/24066
  • [Platform] Custom ops support for LMhead and LogitsProcessor by @zzhx1 in https://github.com/vllm-project/vllm/pull/23564
  • [CI] Fix tensorizer test assertion by @pwschuurman in https://github.com/vllm-project/vllm/pull/24545
  • [Core] Split LoRA layers by @jeejeelee in https://github.com/vllm-project/vllm/pull/24574
  • [Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser by @WangErXiao in https://github.com/vllm-project/vllm/pull/24589
  • [Logging] allow config logging stream by @842974287 in https://github.com/vllm-project/vllm/pull/24336
  • [Bugfix] fix modelopt exclude_modules name mapping by @tomeras91 in https://github.com/vllm-project/vllm/pull/24178
  • [Bugfix] Fix DeepEP config for DP4TP4 by @minosfuture in https://github.com/vllm-project/vllm/pull/23619
  • [Core] Support configuration parsing plugin by @charlotte12l in https://github.com/vllm-project/vllm/pull/24277
  • [Misc] update log level debug to warning when process port is used by by @lengrongfu in https://github.com/vllm-project/vllm/pull/24226
  • [Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs by @gau-nernst in https://github.com/vllm-project/vllm/pull/24577
  • [CI] Fail subprocess tests with root-cause error by @njhill in https://github.com/vllm-project/vllm/pull/23795
  • [v1] Add Whisper model support (encoder-decoder) by @russellb in https://github.com/vllm-project/vllm/pull/21088
  • [torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends by @gshtras in https://github.com/vllm-project/vllm/pull/19767
  • [gpt-oss] raise error for flashinfer backend without trtllm by @heheda12345 in https://github.com/vllm-project/vllm/pull/24482
  • [Perf] Warmup FlashInfer attention during startup by @mgoin in https://github.com/vllm-project/vllm/pull/23439
  • [Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration by @hjjq in https://github.com/vllm-project/vllm/pull/21078
  • [Misc] Make timeout passable in init_distributed_environment by @jberkhahn in https://github.com/vllm-project/vllm/pull/24522
  • [Models][Quantization] Add quantization configuration update in Voxtral model by @anmarques in https://github.com/vllm-project/vllm/pull/24122
  • [distributed] update known issues by @youkaichao in https://github.com/vllm-project/vllm/pull/24624
  • Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24406
  • [Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers by @wwl2755 in https://github.com/vllm-project/vllm/pull/24613
  • [Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides by @petersalas in https://github.com/vllm-project/vllm/pull/24131
  • [Bugfix] Add missing VIT backend dispatch on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24623
  • [BugFix] Fix pipeline parallel by @njhill in https://github.com/vllm-project/vllm/pull/24621
  • [Engine][Chore] use local variable and remove output var assignment by @GuyStone in https://github.com/vllm-project/vllm/pull/24554
  • Kimi K2 Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24597
  • Enable --profile in 'vllm bench throughput' by @tomasruizt in https://github.com/vllm-project/vllm/pull/24575
  • [Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre by @shengshiqi-google in https://github.com/vllm-project/vllm/pull/24469
  • [Doc]: fixing doc typos by @didier-durand in https://github.com/vllm-project/vllm/pull/24635
  • [Model] New model support for Motif-1-Tiny by @ca1207 in https://github.com/vllm-project/vllm/pull/23414
  • Remove redundant all gather + split by @chenxi-yang in https://github.com/vllm-project/vllm/pull/23441
  • [torchao] Support quantization configs using module swap by @jerryzh168 in https://github.com/vllm-project/vllm/pull/21982
  • Add the support for the qwen3 next model (a hybrid attention model). by @sighingnow in https://github.com/vllm-project/vllm/pull/24526
  • [Bugfix] Fix incorrect import of CacheConfig by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24631
  • [Docs] Revise frameworks/anything-llm.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24489
  • [Docs] Update V1 doc to reflect whisper support by @russellb in https://github.com/vllm-project/vllm/pull/24606
  • [Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ by @windsonsea in https://github.com/vllm-project/vllm/pull/24633
  • [CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test by @charlotte12l in https://github.com/vllm-project/vllm/pull/24615
  • [Bugfix] Fix _synced_weight_loader by @kyuyeunk in https://github.com/vllm-project/vllm/pull/24565
  • [CI] Split pooling from entrypoints Test by @noooop in https://github.com/vllm-project/vllm/pull/24632
  • [Misc] Add @NickLucche to codeowners by @NickLucche in https://github.com/vllm-project/vllm/pull/24647
  • [CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py by @noooop in https://github.com/vllm-project/vllm/pull/24636
  • Fix typing for safetensors_load_strategy by @hmellor in https://github.com/vllm-project/vllm/pull/24641
  • Move LoRAConfig from config/__init__.py to config/lora.py by @hmellor in https://github.com/vllm-project/vllm/pull/24644
  • [XPU] add missing dependency tblib for XPU CI by @faaany in https://github.com/vllm-project/vllm/pull/24639
  • [Docs] Fixes a typo in the qwen3next model name. by @sighingnow in https://github.com/vllm-project/vllm/pull/24654
  • [build] add torch to tool.uv no-build-isolation-package by @youkaichao in https://github.com/vllm-project/vllm/pull/24303
  • [Bench] Add qwen-next in benchmark_moe.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/24661
  • [CI] Split mteb test from Language Models Test by @noooop in https://github.com/vllm-project/vllm/pull/24634
  • Allow users to specify kv cache memory size by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/21489
  • [HybridKVCache][Platform] Add support_hybrid_kv_cache for platform by @MengqingCao in https://github.com/vllm-project/vllm/pull/24646
  • [Bugifx] Fix qwen-next packed_modules_mapping by @jeejeelee in https://github.com/vllm-project/vllm/pull/24656
  • [Docs] Add transcription support to model by @NickLucche in https://github.com/vllm-project/vllm/pull/24664
  • [Doc] Fix Markdown Pre-commit Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24670
  • [Docs] Fix typos in EP deployment doc by @hmellor in https://github.com/vllm-project/vllm/pull/24669
  • [VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames by @Isotr0py in https://github.com/vllm-project/vllm/pull/24161
  • [Kernels] Enable Torch Symmetric Memory All-Reduce By Default by @ilmarkov in https://github.com/vllm-project/vllm/pull/24111
  • [Bugfix] Fix platform-specific routing in CustomOp implementations by @kzawora-intel in https://github.com/vllm-project/vllm/pull/24444
  • Fix model name included in responses by @hmellor in https://github.com/vllm-project/vllm/pull/24663
  • fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24616
  • [Docs] Fix formatting of transcription doc by @hmellor in https://github.com/vllm-project/vllm/pull/24676
  • [VLM] Migrate remain DP-supported ViT models to use disable_tp by @Isotr0py in https://github.com/vllm-project/vllm/pull/24363
  • [Ultravox] Use wrapped_model_config to instantiate inner model by @petersalas in https://github.com/vllm-project/vllm/pull/24679
  • [Doc] Remove Useless Comments by @yewentao256 in https://github.com/vllm-project/vllm/pull/24687
  • [Qwen3-Next] Add MoE Config for H200 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24688
  • [BugFix] Fix tokenize asyncio task leak by @njhill in https://github.com/vllm-project/vllm/pull/24677
  • update spec decode metrics to use throughput by @qandrew in https://github.com/vllm-project/vllm/pull/24127
  • [Kernel][B200] mxfp4 fused cutlass moe by @djmmoss in https://github.com/vllm-project/vllm/pull/23696
  • [flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention by @mxz297 in https://github.com/vllm-project/vllm/pull/24197
  • [Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False by @yewentao256 in https://github.com/vllm-project/vllm/pull/24696
  • [Qwen3-Next] MoE configs for H200 TP=1,2,4 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24695
  • [CI/Build] Add bc-linter to vLLM CI by @zhewenl in https://github.com/vllm-project/vllm/pull/21234
  • [Qwen3-Next] Add B200 MoE configs for Qwen3-next by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/24698
  • [Bugfix][Attention] Fix FlashInfer MLA block size logic by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24692
  • [Perf] Use upstream CUTLASS for SM90 Block FP8 kernel by @mgoin in https://github.com/vllm-project/vllm/pull/23280
  • [Qwen3-Next] MOE configs for H100 TP4 by @heheda12345 in https://github.com/vllm-project/vllm/pull/24699
  • [Doc] Clarify cudagraph capture size logic and default behavior in scheduler by @Zazzle516 in https://github.com/vllm-project/vllm/pull/18698
  • [Bug] Fix Layer weight_block_size Assertion Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24674
  • [Startup] Make DeepGEMM warmup scale with max-num-batched-tokens by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24693
  • [V1] feat:add engine v1 tracing by @RichardoMrMu in https://github.com/vllm-project/vllm/pull/20372
  • [Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases by @sighingnow in https://github.com/vllm-project/vllm/pull/24680

New Contributors

  • @DoubleVII made their first contribution in https://github.com/vllm-project/vllm/pull/23058
  • @carlory made their first contribution in https://github.com/vllm-project/vllm/pull/23090
  • @nikheal2 made their first contribution in https://github.com/vllm-project/vllm/pull/22725
  • @Tialo made their first contribution in https://github.com/vllm-project/vllm/pull/23172
  • @myselvess made their first contribution in https://github.com/vllm-project/vllm/pull/23084
  • @yiz-liu made their first contribution in https://github.com/vllm-project/vllm/pull/23169
  • @ultmaster made their first contribution in https://github.com/vllm-project/vllm/pull/22587
  • @KilJaeeun made their first contribution in https://github.com/vllm-project/vllm/pull/22790
  • @wzshiming made their first contribution in https://github.com/vllm-project/vllm/pull/23242
  • @misrasaurabh1 made their first contribution in https://github.com/vllm-project/vllm/pull/20413
  • @yannqi made their first contribution in https://github.com/vllm-project/vllm/pull/23246
  • @jaredoconnell made their first contribution in https://github.com/vllm-project/vllm/pull/23306
  • @paulpak58 made their first contribution in https://github.com/vllm-project/vllm/pull/22845
  • @zhuangqh made their first contribution in https://github.com/vllm-project/vllm/pull/23309
  • @tvalentyn made their first contribution in https://github.com/vllm-project/vllm/pull/23270
  • @arjunbreddy22 made their first contribution in https://github.com/vllm-project/vllm/pull/22495
  • @philipchung made their first contribution in https://github.com/vllm-project/vllm/pull/17149
  • @FoolPlayer made their first contribution in https://github.com/vllm-project/vllm/pull/23241
  • @namanlalitnyu made their first contribution in https://github.com/vllm-project/vllm/pull/23375
  • @hickeyma made their first contribution in https://github.com/vllm-project/vllm/pull/23353
  • @PapaGoose made their first contribution in https://github.com/vllm-project/vllm/pull/23337
  • @bppps made their first contribution in https://github.com/vllm-project/vllm/pull/23366
  • @AzizCode92 made their first contribution in https://github.com/vllm-project/vllm/pull/23416
  • @fengli1702 made their first contribution in https://github.com/vllm-project/vllm/pull/22527
  • @FFFfff1FFFfff made their first contribution in https://github.com/vllm-project/vllm/pull/23408
  • @ayushsatyam146 made their first contribution in https://github.com/vllm-project/vllm/pull/23171
  • @patemotter made their first contribution in https://github.com/vllm-project/vllm/pull/23574
  • @Terrencezzj made their first contribution in https://github.com/vllm-project/vllm/pull/23584
  • @Copilot made their first contribution in https://github.com/vllm-project/vllm/pull/23385
  • @oneraghavan made their first contribution in https://github.com/vllm-project/vllm/pull/23630
  • @lordmathis made their first contribution in https://github.com/vllm-project/vllm/pull/23634
  • @OYE93 made their first contribution in https://github.com/vllm-project/vllm/pull/23565
  • @TianyuLi0 made their first contribution in https://github.com/vllm-project/vllm/pull/23146
  • @yuekaizhang made their first contribution in https://github.com/vllm-project/vllm/pull/23623
  • @coval3nte made their first contribution in https://github.com/vllm-project/vllm/pull/23054
  • @youzhedian made their first contribution in https://github.com/vllm-project/vllm/pull/23648
  • @frank-wei made their first contribution in https://github.com/vllm-project/vllm/pull/23613
  • @faaany made their first contribution in https://github.com/vllm-project/vllm/pull/22500
  • @cndoit18 made their first contribution in https://github.com/vllm-project/vllm/pull/23718
  • @rebel-hongseok made their first contribution in https://github.com/vllm-project/vllm/pull/23746
  • @Hanchenli made their first contribution in https://github.com/vllm-project/vllm/pull/23713
  • @Shrey1306 made their first contribution in https://github.com/vllm-project/vllm/pull/23556
  • @Ithanil made their first contribution in https://github.com/vllm-project/vllm/pull/23155
  • @killershrimp made their first contribution in https://github.com/vllm-project/vllm/pull/23792
  • @crischeng made their first contribution in https://github.com/vllm-project/vllm/pull/23823
  • @angelayi made their first contribution in https://github.com/vllm-project/vllm/pull/23349
  • @jeanschmidt made their first contribution in https://github.com/vllm-project/vllm/pull/23757
  • @He-Jingkai made their first contribution in https://github.com/vllm-project/vllm/pull/23829
  • @aditchawdhary made their first contribution in https://github.com/vllm-project/vllm/pull/23899
  • @EduardDurech made their first contribution in https://github.com/vllm-project/vllm/pull/23068
  • @dubejf made their first contribution in https://github.com/vllm-project/vllm/pull/23944
  • @sadeghja1070 made their first contribution in https://github.com/vllm-project/vllm/pull/23971
  • @DevonPeroutky made their first contribution in https://github.com/vllm-project/vllm/pull/22769
  • @hypdeb made their first contribution in https://github.com/vllm-project/vllm/pull/23802
  • @DamonJiang777 made their first contribution in https://github.com/vllm-project/vllm/pull/24028
  • @cberge908 made their first contribution in https://github.com/vllm-project/vllm/pull/23104
  • @lkm2835 made their first contribution in https://github.com/vllm-project/vllm/pull/23918
  • @nathanrchn made their first contribution in https://github.com/vllm-project/vllm/pull/24100
  • @co63oc made their first contribution in https://github.com/vllm-project/vllm/pull/24071
  • @divyanshsinghvi made their first contribution in https://github.com/vllm-project/vllm/pull/23907
  • @biba10 made their first contribution in https://github.com/vllm-project/vllm/pull/23931
  • @dongbo910220 made their first contribution in https://github.com/vllm-project/vllm/pull/20388
  • @NagyGeorge made their first contribution in https://github.com/vllm-project/vllm/pull/23460
  • @wdhongtw made their first contribution in https://github.com/vllm-project/vllm/pull/24214
  • @bingchen-mi made their first contribution in https://github.com/vllm-project/vllm/pull/23652
  • @anthonsu made their first contribution in https://github.com/vllm-project/vllm/pull/23766
  • @whx-sjtu made their first contribution in https://github.com/vllm-project/vllm/pull/23332
  • @pratapyash made their first contribution in https://github.com/vllm-project/vllm/pull/24231
  • @samanamp made their first contribution in https://github.com/vllm-project/vllm/pull/24266
  • @mohankku made their first contribution in https://github.com/vllm-project/vllm/pull/24335
  • @ashwin-phadke made their first contribution in https://github.com/vllm-project/vllm/pull/24339
  • @bangshengtang made their first contribution in https://github.com/vllm-project/vllm/pull/24265
  • @charlotte12l made their first contribution in https://github.com/vllm-project/vllm/pull/23868
  • @alhridoy made their first contribution in https://github.com/vllm-project/vllm/pull/24361
  • @what-in-the-nim made their first contribution in https://github.com/vllm-project/vllm/pull/24332
  • @BraveY made their first contribution in https://github.com/vllm-project/vllm/pull/23928
  • @R3hankhan123 made their first contribution in https://github.com/vllm-project/vllm/pull/24034
  • @csahithi made their first contribution in https://github.com/vllm-project/vllm/pull/24102
  • @Prowindy made their first contribution in https://github.com/vllm-project/vllm/pull/24298
  • @micah-wil made their first contribution in https://github.com/vllm-project/vllm/pull/24128
  • @pwschuurman made their first contribution in https://github.com/vllm-project/vllm/pull/23845
  • @comsky made their first contribution in https://github.com/vllm-project/vllm/pull/24348
  • @li-jinpeng made their first contribution in https://github.com/vllm-project/vllm/pull/24519
  • @baonudesifeizhai made their first contribution in https://github.com/vllm-project/vllm/pull/23978
  • @lacora made their first contribution in https://github.com/vllm-project/vllm/pull/24549
  • @RoadToNowhereX made their first contribution in https://github.com/vllm-project/vllm/pull/24217
  • @zzhx1 made their first contribution in https://github.com/vllm-project/vllm/pull/23564
  • @hjjq made their first contribution in https://github.com/vllm-project/vllm/pull/21078
  • @tomasruizt made their first contribution in https://github.com/vllm-project/vllm/pull/24575
  • @shengshiqi-google made their first contribution in https://github.com/vllm-project/vllm/pull/24469
  • @ca1207 made their first contribution in https://github.com/vllm-project/vllm/pull/23414
  • @qandrew made their first contribution in https://github.com/vllm-project/vllm/pull/24127
  • @Zazzle516 made their first contribution in https://github.com/vllm-project/vllm/pull/18698
  • @RichardoMrMu made their first contribution in https://github.com/vllm-project/vllm/pull/20372

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.10.1.1...v0.10.2rc3