FList

Highlights

v0.10.0 release includes 308 commits, 168 contributors (62 new!).

NOTE: This release begins the cleanup of V0 engine codebase. We have removed V0 CPU/XPU/TPU/HPU backends (#20412), long context LoRA (#21169), Prompt Adapters (#20588), Phi3-Small & BlockSparse Attention (#21217), and Spec Decode workers (#21152) so far and plan to continued to delete code that is no longer used.

Model Support

  • New families: Llama 4 with EAGLE support (#20591), EXAONE 4.0 (#21060), Microsoft Phi-4-mini-flash-reasoning (#20702), Hunyuan V1 Dense + A13B with reasoning/tool parsing (#21368, #20625, #20820), Ling MoE models (#20680), JinaVL Reranker (#20260), Nemotron-Nano-VL-8B-V1 (#20349), Arcee (#21296), Voxtral (#20970).
  • Enhanced compatibility: BERT/RoBERTa with AutoWeightsLoader (#20534), HF format support for MiniMax (#20211), Gemini configuration (#20971), GLM-4 updates (#20736).
  • Architecture expansions: Attention-free model support (#20811), Hybrid SSM/Attention models on V1 (#20016), LlamaForSequenceClassification (#20807), expanded Mamba2 layer support (#20660).
  • VLM improvements: VLM support with transformers backend (#20543), PrithviMAE on V1 engine (#20577).

Engine Core

  • Experimental async scheduling --async-scheduling flag to overlap engine core scheduling with GPU runner (#19970).
  • V1 engine improvements: backend-agnostic local attention (#21093), MLA FlashInfer ragged prefill (#20034), hybrid KV cache with local chunked attention (#19351).
  • Multi-task support: models can now support multiple tasks (#20771), multiple poolers (#21227), and dynamic pooling parameter configuration (#21128).
  • RLHF Support: new RPC methods for runtime weight reloading (#20096) and config updates (#20095), logprobs mode for selecting which stage of logprobs to return (#21398).
  • Enhanced caching: multi-modal caching for transformers backend (#21358), reproducible prefix cache hashing using SHA-256 + CBOR (#20511).
  • Startup time reduction via CUDA graph capture speedup via frozen GC (#21146).
  • Elastic expert parallel for dynamic GPU scaling while preserving state (#20775).

Hardwares & Performance

  • NVIDIA Blackwell/SM100 optimizations: CUTLASS block scaled group GEMM for smaller batches (#20640), FP8 groupGEMM support (#20447), DeepGEMM integration (#20087), FlashInfer MoE blockscale FP8 backend (#20645), CUDNN prefill API for MLA (#20411), Triton Fused MoE kernel config for FP8 E=16 on B200 (#20516).
  • Performance improvements: 48% request duration reduction via microbatch tokenization for concurrent requests (#19334), fused MLA QKV + strided layernorm (#21116), Triton causal-conv1d for Mamba models (#18218).
  • Hardware expansion: ARM CPU int8 quantization (#14129), PPC64LE/ARM V1 support (#20554), Intel XPU ray distributed execution (#20659), shared-memory pipeline parallel for CPU (#21289), FlashInfer ARM CUDA support (#21013).

Quantization

  • New quantization support: MXFP4 for MoE models (#17888), BNB support for Mixtral and additional MoE models (#20893, #21100), in-flight quantization for MoE (#20061).
  • Hardware-specific: FP8 KV cache quantization on TPU (#19292), FP8 support for BatchedTritonExperts (#18864), optimized INT8 vectorization kernels (#20331).
  • Performance optimizations: Triton backend for DeepGEMM per-token group quantization (#20841), CUDA kernel for per-token group quantization (#21083), CustomOp abstraction for FP8 (#19830).

API & Frontend

  • OpenAI compatibility: Responses API implementation (#20504, #20975), image object support in llm.chat (#19635), tool calling with required choice and $defs (#20629).
  • New endpoints: get_tokenizer_info for tokenizer/chat-template information (#20575), cache_salt support for completions/responses (#20981).
  • Model loading: Tensorizer S3 integration with arbitrary arguments (#19619), HF repo paths & URLs for GGUF models (#20793), tokenization_kwargs for embedding truncation (#21033).
  • CLI improvements: --help=page option for enhanced help documentation (#20961), default model changed to Qwen3-0.6B (#20335).

Dependencies

  • Updated PyTorch to 2.7.1 for CUDA (#21011)
  • FlashInfer updated to v0.2.8rc1 (#20718)

What's Changed

  • [Docs] Note that alternative structured output backends are supported by @russellb in https://github.com/vllm-project/vllm/pull/19426
  • [ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in https://github.com/vllm-project/vllm/pull/19440
  • [Model] use AutoWeightsLoader for commandr by @py-andy-c in https://github.com/vllm-project/vllm/pull/19399
  • Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/19401
  • [BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in https://github.com/vllm-project/vllm/pull/19390
  • [New Model]: Support Qwen3 Embedding & Reranker by @noooop in https://github.com/vllm-project/vllm/pull/19260
  • [BugFix] Fix docker build cpu-dev image error by @2niuhe in https://github.com/vllm-project/vllm/pull/19394
  • Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in https://github.com/vllm-project/vllm/pull/19451
  • [CI] Disable failing GGUF model test by @mgoin in https://github.com/vllm-project/vllm/pull/19454
  • [Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in https://github.com/vllm-project/vllm/pull/19422
  • Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in https://github.com/vllm-project/vllm/pull/19455
  • Fix Typo in Documentation and Function Name by @leopardracer in https://github.com/vllm-project/vllm/pull/19442
  • [ROCm] Add rules to automatically label ROCm related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/19405
  • [Kernel] Support deep_gemm for linear methods by @artetaout in https://github.com/vllm-project/vllm/pull/19085
  • [Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19474
  • [Doc] Fix quantization link titles by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19478
  • [Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19479
  • [Misc] Reduce warning message introduced in env_override by @houseroad in https://github.com/vllm-project/vllm/pull/19476
  • Support non-string values in JSON keys from CLI by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19471
  • Add cache to cuda get_device_capability by @mgoin in https://github.com/vllm-project/vllm/pull/19436
  • Fix some typo by @Ximingwang-09 in https://github.com/vllm-project/vllm/pull/19475
  • Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in https://github.com/vllm-project/vllm/pull/19241
  • [Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in https://github.com/vllm-project/vllm/pull/19453
  • [CI] Update FlashInfer to 0.2.6.post1 by @mgoin in https://github.com/vllm-project/vllm/pull/19297
  • [doc] fix "Other AI accelerators" getting started page by @davidxia in https://github.com/vllm-project/vllm/pull/19457
  • [Misc] Fix misleading ROCm warning by @jeejeelee in https://github.com/vllm-project/vllm/pull/19486
  • [Docs] Remove WIP features in V1 guide by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19498
  • [Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in https://github.com/vllm-project/vllm/pull/19168
  • [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in https://github.com/vllm-project/vllm/pull/17331
  • [UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/19501
  • [CI/Build] Fix torch nightly CI dependencies by @zou3519 in https://github.com/vllm-project/vllm/pull/19505
  • [CI] change spell checker from codespell to typos by @andyxning in https://github.com/vllm-project/vllm/pull/18711
  • [BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19514
  • Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in https://github.com/vllm-project/vllm/pull/19518
  • [Frontend] Improve error message in tool_choice validation by @22quinn in https://github.com/vllm-project/vllm/pull/19239
  • [BugFix] Work-around incremental detokenization edge case error by @njhill in https://github.com/vllm-project/vllm/pull/19449
  • [BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in https://github.com/vllm-project/vllm/pull/19522
  • [AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in https://github.com/vllm-project/vllm/pull/19509
  • Fix typo by @2niuhe in https://github.com/vllm-project/vllm/pull/19525
  • [Security] Prevent new imports of (cloud)pickle by @russellb in https://github.com/vllm-project/vllm/pull/18018
  • [Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in https://github.com/vllm-project/vllm/pull/19492
  • [Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in https://github.com/vllm-project/vllm/pull/19503
  • [Quantization] Improve AWQ logic by @jeejeelee in https://github.com/vllm-project/vllm/pull/19431
  • [Doc] Add V1 column to supported models list by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/19523
  • [NixlConnector] Drop num_blocks check by @NickLucche in https://github.com/vllm-project/vllm/pull/19532
  • [Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in https://github.com/vllm-project/vllm/pull/19233
  • Fix TorchAOConfig skip layers by @mobicham in https://github.com/vllm-project/vllm/pull/19265
  • [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in https://github.com/vllm-project/vllm/pull/16756
  • [doc] Make top navigation sticky by @reidliu41 in https://github.com/vllm-project/vllm/pull/19540
  • [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/18847
  • [Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19506
  • [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in https://github.com/vllm-project/vllm/pull/19452
  • [Doc] Unify structured outputs examples by @aarnphm in https://github.com/vllm-project/vllm/pull/18196
  • [V1] Resolve failed concurrent structred output requests by @russellb in https://github.com/vllm-project/vllm/pull/19565
  • Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19378
  • [BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/19515
  • [Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in https://github.com/vllm-project/vllm/pull/19570
  • [Doc] uses absolute links for structured outputs by @aarnphm in https://github.com/vllm-project/vllm/pull/19582
  • [doc] fix incorrect link by @reidliu41 in https://github.com/vllm-project/vllm/pull/19586
  • [Misc] Correct broken docs link by @Zerohertz in https://github.com/vllm-project/vllm/pull/19553
  • [CPU] Refine default config for the CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19539
  • [Fix] bump mistral common to support magistral by @princepride in https://github.com/vllm-project/vllm/pull/19533
  • [Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in https://github.com/vllm-project/vllm/pull/19549
  • use base version for version comparison by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/19587
  • [torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in https://github.com/vllm-project/vllm/pull/19064
  • [BugFix] Honor enable_caching in connector-delayed kvcache load case by @njhill in https://github.com/vllm-project/vllm/pull/19435
  • [Model] Fix minimax model cache & lm_head precision by @qscqesze in https://github.com/vllm-project/vllm/pull/19592
  • [Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl by @yewentao256 in https://github.com/vllm-project/vllm/pull/19573
  • [doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in https://github.com/vllm-project/vllm/pull/19606
  • [CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in https://github.com/vllm-project/vllm/pull/18581
  • [Doc] Add troubleshooting section to k8s deployment by @annapendleton in https://github.com/vllm-project/vllm/pull/19377
  • [torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19618
  • Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in https://github.com/vllm-project/vllm/pull/19508
  • [BugFix] Fix DP Coordinator incorrect debug log message by @njhill in https://github.com/vllm-project/vllm/pull/19624
  • [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in https://github.com/vllm-project/vllm/pull/18354
  • [Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in https://github.com/vllm-project/vllm/pull/19633
  • [Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in https://github.com/vllm-project/vllm/pull/19593
  • [Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in https://github.com/vllm-project/vllm/pull/19316
  • [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in https://github.com/vllm-project/vllm/pull/19500
  • Only build CUTLASS MoE kernels on Hopper by @huydhn in https://github.com/vllm-project/vllm/pull/19648
  • [Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in https://github.com/vllm-project/vllm/pull/19561
  • [Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in https://github.com/vllm-project/vllm/pull/19262
  • [Perf] Further tunings for SM100 FP8 CUTLASS kernel by @ilmarkov in https://github.com/vllm-project/vllm/pull/19566
  • [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness by @houseroad in https://github.com/vllm-project/vllm/pull/19644
  • [Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check by @22quinn in https://github.com/vllm-project/vllm/pull/19339
  • [Benchmark] Refactor benchmark script for fp8 & int8 by @yewentao256 in https://github.com/vllm-project/vllm/pull/19627
  • Enable prefix caching with full cuda graphs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19617
  • [CI/Build] Fix torch nightly CI dependencies part 2 by @zou3519 in https://github.com/vllm-project/vllm/pull/19589
  • [Misc] Remove duplicate multiproc method setting for CPU platform by @Isotr0py in https://github.com/vllm-project/vllm/pull/19649
  • [MISC] Remove unused variableds in C++ by @houseroad in https://github.com/vllm-project/vllm/pull/19609
  • [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker by @quanliu1991 in https://github.com/vllm-project/vllm/pull/18957
  • [Misc][Frontend] passthrough bad_words by @f14-bertolotti in https://github.com/vllm-project/vllm/pull/19564
  • [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19660
  • [TPU] support attention head dim smaller than 128 by @yaochengji in https://github.com/vllm-project/vllm/pull/19620
  • [MISC] typo fix by @andyxning in https://github.com/vllm-project/vllm/pull/19672
  • [CI] Add mteb testing for rerank models by @noooop in https://github.com/vllm-project/vllm/pull/19344
  • [Docs] Move multiproc doc to v1 dir by @russellb in https://github.com/vllm-project/vllm/pull/19651
  • [Kernel] GGUF MMVQ kernel for multiple input vectors by @SzymonOzog in https://github.com/vllm-project/vllm/pull/18754
  • [BugFix] Don't catch BaseException when dumping execute_model errors by @njhill in https://github.com/vllm-project/vllm/pull/19626
  • [DOC] Add reasoning capability to vLLM streamlit code by @Navanit-git in https://github.com/vllm-project/vllm/pull/19557
  • [Feature]:Allow for Granite MoE Hybrid models with only shared experts. by @shawntan in https://github.com/vllm-project/vllm/pull/19652
  • [Bugfix] Fix TP inference for Flex attention backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/19657
  • [MISC] bump huggingface_hub pkg to 0.33.0 by @andyxning in https://github.com/vllm-project/vllm/pull/19547
  • [Bugfix] fix missing 'finish_reason': null in streaming chat by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/19662
  • [Kernels] Use empty for modular MoE workspaces by @bnellnm in https://github.com/vllm-project/vllm/pull/19667
  • [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) by @qscqesze in https://github.com/vllm-project/vllm/pull/19677
  • [V1] Change return type on get_multimodal_embeddings() by @russellb in https://github.com/vllm-project/vllm/pull/19446
  • [Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 by @dsikka in https://github.com/vllm-project/vllm/pull/19563
  • [Fix] Fall back to Gloo when NCCL backend is unavailable by @conroy-cheers in https://github.com/vllm-project/vllm/pull/19641
  • [doc] add project flag to gcloud TPU command by @davidxia in https://github.com/vllm-project/vllm/pull/19664
  • [Wheel Size] Only build FA2 8.0+PTX by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/19336
  • [Frontend] add chunking audio for > 30s audio by @nguyenhoangthuan99 in https://github.com/vllm-project/vllm/pull/19597
  • [DOC] fix doc typos by @diliu0349 in https://github.com/vllm-project/vllm/pull/19600
  • Fixes IMA for TP w/ flex-attention by @drisspg in https://github.com/vllm-project/vllm/pull/19712
  • [Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager by @quanliu1991 in https://github.com/vllm-project/vllm/pull/19686
  • [Doc] Add missing llava family multi-image examples by @Isotr0py in https://github.com/vllm-project/vllm/pull/19698
  • Add a doc on how to update PyTorch version by @huydhn in https://github.com/vllm-project/vllm/pull/19705
  • [Kernel] Add Split-KV Support to Unified Triton Attention Kernel by @jvlunteren in https://github.com/vllm-project/vllm/pull/19152
  • [doc][mkdocs] Add edit button to documentation by @reidliu41 in https://github.com/vllm-project/vllm/pull/19637
  • [doc] split "Other AI Accelerators" tabs by @davidxia in https://github.com/vllm-project/vllm/pull/19708
  • [V1][Kernel] Flashinfer HND KV cache layout by @NickLucche in https://github.com/vllm-project/vllm/pull/19280
  • [Mis] remove duplicate engine status checks by @googs1025 in https://github.com/vllm-project/vllm/pull/19647
  • [Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 by @Isotr0py in https://github.com/vllm-project/vllm/pull/19151
  • [Perf] Optimize moe_align_block_size CUDA kernel by @yewentao256 in https://github.com/vllm-project/vllm/pull/19572
  • Remove sm120 arch from sm100 cutlass kernel arch list by @mgoin in https://github.com/vllm-project/vllm/pull/19716
  • [Misc] Update lmcache connector with the latest connector apis by @YaoJiayi in https://github.com/vllm-project/vllm/pull/19441
  • [Bugfix] Fix faulty triton importing logic when using Ray for DP by @mgoin in https://github.com/vllm-project/vllm/pull/19734
  • [Feature][ROCm] Add full graph capture support for TritonAttentionBackend by @charlifu in https://github.com/vllm-project/vllm/pull/19158
  • [TPU] Update torch version to include paged attention kernel change by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19706
  • [MISC] correct copy_blocks src_to_dists param type by @andyxning in https://github.com/vllm-project/vllm/pull/19696
  • [MISC] correct DeviceConfig device field static type analysis by @andyxning in https://github.com/vllm-project/vllm/pull/19699
  • [Misc] Add str for RequestStatus by @lk-chen in https://github.com/vllm-project/vllm/pull/19780
  • [V1] Add API docs for EncoderCacheManager by @russellb in https://github.com/vllm-project/vllm/pull/19294
  • [V1][P/D] An native implementation of xPyD based on P2P NCCL by @Abatom in https://github.com/vllm-project/vllm/pull/18242
  • [V1] Decouple GPU and TPU InputBatch by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19778
  • [Minor] Zero-initialize attn output buffer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19784
  • [doc] fix the incorrect label by @reidliu41 in https://github.com/vllm-project/vllm/pull/19787
  • [Platform] Allow platform use V1 Engine by default by @wangxiyuan in https://github.com/vllm-project/vllm/pull/19792
  • [Qwen] Add tagging rule for Qwen related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/19799
  • [Hardware][AMD] integrate aiter chunked prefill into vllm by @Zzz9990 in https://github.com/vllm-project/vllm/pull/18596
  • [Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/19725
  • [Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc by @russellb in https://github.com/vllm-project/vllm/pull/19808
  • [v1] Support mamba2 by @heheda12345 in https://github.com/vllm-project/vllm/pull/19327
  • docs: fix Slack bulletpoint in README by @nathan-weinberg in https://github.com/vllm-project/vllm/pull/19811
  • Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19783
  • [Core] Do not copy array during hashing by @lgeiger in https://github.com/vllm-project/vllm/pull/19484
  • [TPU] Update torch-xla version to include paged attention tuned block change by @QiliangCui in https://github.com/vllm-project/vllm/pull/19813
  • [Core] More fixes to MultiModalEmbeddings type handling by @russellb in https://github.com/vllm-project/vllm/pull/19715
  • [Multimodal] Use fast processor for Qwen2/2.5-VL by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19789
  • [BugFix] Fix use_cudagraph=False by @zou3519 in https://github.com/vllm-project/vllm/pull/19612
  • [Frontend] Expose custom args in OpenAI APIs by @afeldman-nm in https://github.com/vllm-project/vllm/pull/16862
  • Fix FA2 fallback for Blackwell V1 by @mgoin in https://github.com/vllm-project/vllm/pull/19781
  • [Misc][ROCm] Enforce no unused variable in ROCm C++ files by @houseroad in https://github.com/vllm-project/vllm/pull/19796
  • [Quantization] Modify the logic of BNB double quantization by @jeejeelee in https://github.com/vllm-project/vllm/pull/19742
  • Support embedding models in V1 by @maxdebayser in https://github.com/vllm-project/vllm/pull/16188
  • [Bugfix] Fix the linter by @houseroad in https://github.com/vllm-project/vllm/pull/19826
  • [Bugfix] Add check_health to v1 async client. by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19821
  • Mark invariant normalizer in Gemma as non-persistent by @yhtang in https://github.com/vllm-project/vllm/pull/19788
  • [ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374 by @tjtanaa in https://github.com/vllm-project/vllm/pull/18990
  • [Misc] [ROCm] Prevent surplus tensor reshape by @zsolt-borbely-htec in https://github.com/vllm-project/vllm/pull/19803
  • raise exception for pin_lora by @andyxning in https://github.com/vllm-project/vllm/pull/19809
  • [Minor] Allow redirecting model path for HfRunner in test by @Isotr0py in https://github.com/vllm-project/vllm/pull/19795
  • Add xLAM tool parser support by @zuxin666 in https://github.com/vllm-project/vllm/pull/17148
  • [Frontend] Add optional token-level progress bar to LLM.beam_search by @NekoMimiUnagi in https://github.com/vllm-project/vllm/pull/19301
  • Fixing Chunked Prefill Test. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/19762
  • [Doc] Update V1 user guide for embedding models by @22quinn in https://github.com/vllm-project/vllm/pull/19842
  • [CI][CPU] Improve dummy Triton interfaces and fix the CPU CI by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19838
  • [Core][Bugfix] Fix Online MM Beam Search by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/19688
  • [Frontend] early return chat format resolution when specified by @xzbdmw in https://github.com/vllm-project/vllm/pull/19735
  • [Benchmark][Bugfix] Fix Dataset Length Calculation by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/19868
  • [CI/Build][Bugfix] Fix deadlock on v1 engine test CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/19872
  • [CI][Neuron] Fail and exit on first error by @elaineyz in https://github.com/vllm-project/vllm/pull/19622
  • [Benchmark] Fix Value of type "SampleRequest" is not indexable by @b8zhong in https://github.com/vllm-project/vllm/pull/18032
  • [Chore]: qwen3-moe-type-hints-mistake by @Xerxes-cn in https://github.com/vllm-project/vllm/pull/19860
  • [Bugfix] Enable PP with AITER+V1 by @qli88 in https://github.com/vllm-project/vllm/pull/19822
  • [Bugfix][Ray] Set the cuda context eagerly in the ray worker by @kouroshHakha in https://github.com/vllm-project/vllm/pull/19583
  • [Misc] update cuda version by @reidliu41 in https://github.com/vllm-project/vllm/pull/19526
  • [Misc] refactor example - openai_transcription_client by @reidliu41 in https://github.com/vllm-project/vllm/pull/19851
  • [Kernel] correct cpu worker function parameter type by @andyxning in https://github.com/vllm-project/vllm/pull/19745
  • [Fix] import regex instead of re by @tdoublep in https://github.com/vllm-project/vllm/pull/19875
  • [Model] GPT2ForSequenceClassification model by @nie3e in https://github.com/vllm-project/vllm/pull/19663
  • [custom_op][vllm-plugin] update custom_op class to use op_registry by @xuechendi in https://github.com/vllm-project/vllm/pull/19164
  • Export NaNs in logits to scheduler_stats if output is corrupted by @vladmihailescu in https://github.com/vllm-project/vllm/pull/18777
  • [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19901
  • [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError by @andyxning in https://github.com/vllm-project/vllm/pull/19749
  • [Misc] Clean up useless code by @wangxiyuan in https://github.com/vllm-project/vllm/pull/19889
  • Fix: Check the type of params to be a Sequence not list. by @rabinadk1 in https://github.com/vllm-project/vllm/pull/19910
  • [Bugfix] Fix bnb 8bit model weights loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/19917
  • [New model support]Support Tarsier2 by @princepride in https://github.com/vllm-project/vllm/pull/19887
  • [doc] add contact us in community by @reidliu41 in https://github.com/vllm-project/vllm/pull/19922
  • [Multimodal] Optimize Qwen2/2.5-VL startup time by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19756
  • [Docs] Add GPT2ForSequenceClassification to supported models in docs by @nie3e in https://github.com/vllm-project/vllm/pull/19932
  • [Misc] add vllm_config in init by @andyxning in https://github.com/vllm-project/vllm/pull/19866
  • [MISC] add cpu_kvcache_space_bytes to CacheConfig by @andyxning in https://github.com/vllm-project/vllm/pull/19812
  • [Benchmark] fix request loss if "ping" is returned by @sywangyi in https://github.com/vllm-project/vllm/pull/19535
  • [CI/Build] Auto tag perf benchmarks related PRs by @22quinn in https://github.com/vllm-project/vllm/pull/19943
  • [doc] use snippets for contact us by @reidliu41 in https://github.com/vllm-project/vllm/pull/19944
  • [Misc] Update model-specific PR tagging by @ywang96 in https://github.com/vllm-project/vllm/pull/19949
  • [Misc] Simplify vllm bench cli subcommand implementation by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19948
  • [Chore] dedup logs by @aarnphm in https://github.com/vllm-project/vllm/pull/19955
  • [BugFix] Add an env to disable moe chunking to work around compile incompatibility by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/19642
  • [Perf][CLI] Improve overall startup time by @aarnphm in https://github.com/vllm-project/vllm/pull/19941
  • [Core] feat: Implement Priority Scheduling in V1 Engine by @amitm02 in https://github.com/vllm-project/vllm/pull/19057
  • [Misc] Configurable timeout for execute_model RPC calls via env var by @jinqinn in https://github.com/vllm-project/vllm/pull/19544
  • Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor by @Flink-ddd in https://github.com/vllm-project/vllm/pull/19643
  • [doc] Fold long code blocks to improve readability by @reidliu41 in https://github.com/vllm-project/vllm/pull/19926
  • [P/D][NixlConnector] Support tp_size > num_kv_heads deployments by @NickLucche in https://github.com/vllm-project/vllm/pull/19691
  • [BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when all transfer done by @lk-chen in https://github.com/vllm-project/vllm/pull/19874
  • [Doc] Update V1 status for decoder-only embedding models by @Isotr0py in https://github.com/vllm-project/vllm/pull/19952
  • [doc] use MkDocs collapsible blocks - supplement by @reidliu41 in https://github.com/vllm-project/vllm/pull/19973
  • [Bugfix] Fix CI bitsandbytes failure by @jeejeelee in https://github.com/vllm-project/vllm/pull/19969
  • [doc] improve readability for long commands by @reidliu41 in https://github.com/vllm-project/vllm/pull/19920
  • [Docs] Fix syntax highlighting of shell commands by @lgeiger in https://github.com/vllm-project/vllm/pull/19870
  • [EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/19885
  • [Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/19956
  • [Misc] Add type alias ReqId and EngineId for better readability by @lk-chen in https://github.com/vllm-project/vllm/pull/19880
  • [Feature] Support sequence parallelism for static fp8 quantization by @cascade812 in https://github.com/vllm-project/vllm/pull/19181
  • [CI/Build] Push latest tag for cpu and neuron docker image by @22quinn in https://github.com/vllm-project/vllm/pull/19897
  • Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend by @Jun-Howie in https://github.com/vllm-project/vllm/pull/19395
  • [Bugfix][Benchmark] Fix Marlin benchmark by @22quinn in https://github.com/vllm-project/vllm/pull/19929
  • [TPU] Fix tpu model runner test by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19995
  • Update test case parameter to have the throughput above 8.0 by @QiliangCui in https://github.com/vllm-project/vllm/pull/19994
  • [Misc][Tools][Benchmark] Add profile to autotune script by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19711
  • [doc] Fix broken link in the installation for CPU by @yankay in https://github.com/vllm-project/vllm/pull/19980
  • add some examples for other benchmark scripts by @reidliu41 in https://github.com/vllm-project/vllm/pull/19893
  • [PERF] Speedup of MRoPE prepare inputs by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/19939
  • [Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20014
  • refactor example - qwen3_reranker by @reidliu41 in https://github.com/vllm-project/vllm/pull/19847
  • [Fix][V1] Remove --scheduling-policy oracle by @amitm02 in https://github.com/vllm-project/vllm/pull/20010
  • [Perf] Improve/Fix-regression for FA3 in High QPS regimes by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/19463
  • [Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. by @dtransposed in https://github.com/vllm-project/vllm/pull/19423
  • [BugFix] Fix multi-node offline data parallel by @njhill in https://github.com/vllm-project/vllm/pull/19937
  • [P/D] Asynchronously do _nixl_handshake by @lk-chen in https://github.com/vllm-project/vllm/pull/19836
  • [Feature] Integrate new deepgemm by @yewentao256 in https://github.com/vllm-project/vllm/pull/19820
  • [Easy] Remove submodule added in #19463 by @b8zhong in https://github.com/vllm-project/vllm/pull/20039
  • use .dev for version comparison with pytorch nightly release by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/20031
  • cmake: Update vllm_flash_attn for vllm_kernels by @seemethere in https://github.com/vllm-project/vllm/pull/20032
  • [Llama4] Update attn_temperature_tuning by @b8zhong in https://github.com/vllm-project/vllm/pull/19997
  • Revert "[Feature] Integrate new deepgemm (#19820)" by @yewentao256 in https://github.com/vllm-project/vllm/pull/20049
  • Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" by @Isotr0py in https://github.com/vllm-project/vllm/pull/20030
  • Move to a faster base64 implementation by @h-avsha in https://github.com/vllm-project/vllm/pull/19984
  • [Frontend] speed up import time of vllm.config by @davidxia in https://github.com/vllm-project/vllm/pull/18036
  • [Refactor] Remove duplicate ceil_div by @yewentao256 in https://github.com/vllm-project/vllm/pull/20023
  • [Feat][CLI] enforce-include-usage by @max-wittig in https://github.com/vllm-project/vllm/pull/19695
  • [Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. by @bnellnm in https://github.com/vllm-project/vllm/pull/19717
  • [Chore] debloat some initial logs by @aarnphm in https://github.com/vllm-project/vllm/pull/19438
  • [BugFix] Fix full-cuda-graph illegal memory access in FA3 by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20057
  • [doc] add reference link for Intel XPU by @reidliu41 in https://github.com/vllm-project/vllm/pull/20064
  • [Doc] Guide for Incremental Compilation Workflow by @mgoin in https://github.com/vllm-project/vllm/pull/19109
  • [V1][Speculative Decoding] Fix DeepSeek MTP by @cjackal in https://github.com/vllm-project/vllm/pull/20022
  • [Frontend] Add /v1/audio/translations OpenAI API endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/19615
  • [Quantization] Add compressed-tensors emulations support for NVFP4 by @dsikka in https://github.com/vllm-project/vllm/pull/19879
  • [Fix] Support cls pooling in ModernBertPooler by @lsz05 in https://github.com/vllm-project/vllm/pull/20067
  • static_scaled_fp8_quant should not run when scale.numel is not 1 by @eldarkurtic in https://github.com/vllm-project/vllm/pull/20076
  • [PD] let toy proxy handle /chat/completions by @lk-chen in https://github.com/vllm-project/vllm/pull/19730
  • [Misc] Add parallel state node_count function by @njhill in https://github.com/vllm-project/vllm/pull/20045
  • Fix the path to the testing script. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20082
  • [Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine by @izhuhaoran in https://github.com/vllm-project/vllm/pull/20062
  • [TPU][Bugfix] fix kv cache padding by @yaochengji in https://github.com/vllm-project/vllm/pull/20048
  • [P/D] Avoid stranding blocks in P when aborted in D's waiting queue by @njhill in https://github.com/vllm-project/vllm/pull/19223
  • [TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN by @Chenyaaang in https://github.com/vllm-project/vllm/pull/19919
  • [CI] Add SM120 to the Dockerfile by @mgoin in https://github.com/vllm-project/vllm/pull/19794
  • [Bugfix] Fix Mistral tool-parser regex for nested JSON by @mgoin in https://github.com/vllm-project/vllm/pull/20093
  • [PD] Skip tp_size exchange with rank0 by @NickLucche in https://github.com/vllm-project/vllm/pull/19413
  • [Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/20083
  • [Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id by @eicherseiji in https://github.com/vllm-project/vllm/pull/18979
  • [Doc] Update docs for New Model Implementation by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20115
  • [Refactor] Remove unused library by @yewentao256 in https://github.com/vllm-project/vllm/pull/20099
  • [CPU] Fix torch version in x86 CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/19258
  • [Misc] Use collapsible blocks for benchmark examples. by @reidliu41 in https://github.com/vllm-project/vllm/pull/20017
  • [Docs] Improve frameworks/helm.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20113
  • [Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) by @tjtanaa in https://github.com/vllm-project/vllm/pull/19904
  • Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" by @mgoin in https://github.com/vllm-project/vllm/pull/20128
  • [Bug Fix] Fix address/port already in use error for pplx test by @yewentao256 in https://github.com/vllm-project/vllm/pull/20094
  • [Doc] Automatically signed-off by PyCharm by @noooop in https://github.com/vllm-project/vllm/pull/20120
  • [Doc] Auto sign-off for VSCode by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20132
  • [Doc] Rename page titles by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20130
  • Spam folks if config.py changes by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20131
  • [Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. by @jikunshang in https://github.com/vllm-project/vllm/pull/19560
  • [TPU] add kv cache update kernel by @yaochengji in https://github.com/vllm-project/vllm/pull/19928
  • [Refactor] Rename commnication utils by @yewentao256 in https://github.com/vllm-project/vllm/pull/20091
  • [Doc] correct LoRA capitalization by @kyolebu in https://github.com/vllm-project/vllm/pull/20135
  • [Feature] Expert Parallelism Load Balancer (EPLB) by @abmfy in https://github.com/vllm-project/vllm/pull/18343
  • [CI Failure] Fix OOM with test_oot_registration_embedding by @mgoin in https://github.com/vllm-project/vllm/pull/20144
  • [Quantization] Bump to use latest compressed-tensors by @dsikka in https://github.com/vllm-project/vllm/pull/20033
  • [Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler by @ilmarkov in https://github.com/vllm-project/vllm/pull/20071
  • [Bugfix] Build moe_data for both sm100 and sm90 by @mgoin in https://github.com/vllm-project/vllm/pull/20086
  • [Feature][Rocm] add quick all reduce for rocm by @lihaoyang-amd in https://github.com/vllm-project/vllm/pull/19744
  • [CI] Sync test dependency with test.in for torch nightly by @yangw-dev in https://github.com/vllm-project/vllm/pull/19632
  • [Fix] Fix gemma CI test failing on main by @tdoublep in https://github.com/vllm-project/vllm/pull/20124
  • [Model][1/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/20012
  • [Perf][Frontend]: eliminate api_key and x_request_id headers middleware overhead by @Yazan-Sharaya in https://github.com/vllm-project/vllm/pull/19946
  • Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn by @xuechendi in https://github.com/vllm-project/vllm/pull/20143
  • Gemma3n (Text-only) by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/20134
  • [Bugfix] Fix flaky failure when getting DP ports by @mgoin in https://github.com/vllm-project/vllm/pull/20151
  • [Perf][Frontend] Cached resolution for resolving chat templates by @ilyal-cerebras in https://github.com/vllm-project/vllm/pull/20065
  • [Fix][ROCm] Remove unused variables to fix build error on GFX11/12 by @hyoon1 in https://github.com/vllm-project/vllm/pull/19891
  • [Fix][torch.compile] Enable custom ops by default when Inductor off by @ProExpertProg in https://github.com/vllm-project/vllm/pull/20102
  • [Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. by @bnellnm in https://github.com/vllm-project/vllm/pull/20152
  • [Bugfix] Fix some narrowing conversion warnings by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20141
  • [CI/Build] Allow hermetic builds by @fabiendupont in https://github.com/vllm-project/vllm/pull/18064
  • [CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes by @mgoin in https://github.com/vllm-project/vllm/pull/20199
  • [Misc] Add type assertion of request_id for LLMEngine.add_request by @SHA-4096 in https://github.com/vllm-project/vllm/pull/19700
  • Fix num_token_padding support for static per-tensor scaled_fp8_quant by @mgoin in https://github.com/vllm-project/vllm/pull/20188
  • fix ci issue distributed 4 gpu test by @yewentao256 in https://github.com/vllm-project/vllm/pull/20204
  • [Bugfix] Properly reject requests with empty list guided_choice by @mgoin in https://github.com/vllm-project/vllm/pull/20195
  • [BugFix] Fix the incorrect func name in the comments. (config.py) by @1195343015 in https://github.com/vllm-project/vllm/pull/20185
  • [CI/Build] Add new CI job to validate Hybrid Models for every PR by @tdoublep in https://github.com/vllm-project/vllm/pull/20147
  • [Frontend] Generalize v1/audio/transcriptions endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/20179
  • [Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution by @s3woz in https://github.com/vllm-project/vllm/pull/20137
  • [Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx by @yewentao256 in https://github.com/vllm-project/vllm/pull/20187
  • [CI Fix] Try fixing eagle e2e test OOM by reducing block allocation by @mgoin in https://github.com/vllm-project/vllm/pull/20213
  • [Quantization] Add compressed-tensors NVFP4 MoE Support by @dsikka in https://github.com/vllm-project/vllm/pull/19990
  • Fix cuda_archs_loose_intersection when handling sm_*a by @huydhn in https://github.com/vllm-project/vllm/pull/20207
  • [Model] support dots1 by @redmoe-moutain in https://github.com/vllm-project/vllm/pull/18254
  • [BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert by @xuechendi in https://github.com/vllm-project/vllm/pull/20202
  • [Misc] Fix import by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20233
  • [doc] Add Slack and Forum to the top navigation by @reidliu41 in https://github.com/vllm-project/vllm/pull/20208
  • [Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model by @noiji in https://github.com/vllm-project/vllm/pull/19598
  • [Bugfix] Fix processor initialization in transformers 4.53.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/20244
  • [Quantization] Improve BitsAndBytesModelLoader by @jeejeelee in https://github.com/vllm-project/vllm/pull/20242
  • [Docs] Fix 1-2-3 list in v1/prefix_caching.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20243
  • [Bugfix] fix quark ptpc by @lihaoyang-amd in https://github.com/vllm-project/vllm/pull/20251
  • [Spec Decode] Refactor spec decoding into a separate function by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20238
  • [Spec Decode] Clean up spec decode example by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20240
  • [Optimization] Use Shared CachedRequestData Instance Across All Requests by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20232
  • [Unit Test] Add unit test for deep gemm by @yewentao256 in https://github.com/vllm-project/vllm/pull/20090
  • [Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models by @kylesayrs in https://github.com/vllm-project/vllm/pull/20058
  • [Refactor] Remove useless pdb comment by @yewentao256 in https://github.com/vllm-project/vllm/pull/20266
  • [Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector by @Abatom in https://github.com/vllm-project/vllm/pull/20263
  • [CLI] Improve CLI arg parsing for -O/--compilation-config by @ProExpertProg in https://github.com/vllm-project/vllm/pull/20156
  • [Bugfix] Fix include prompt in stream response when echo=true by @fyuan1316 in https://github.com/vllm-project/vllm/pull/15233
  • [Misc] Fix spec decode example by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20296
  • [Example] add one-click runnable example for P2P NCCL XpYd by @KuntaiDu in https://github.com/vllm-project/vllm/pull/20246
  • [CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test by @xuechendi in https://github.com/vllm-project/vllm/pull/20196
  • [Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA by @chewong in https://github.com/vllm-project/vllm/pull/15897
  • [Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference by @sakogan in https://github.com/vllm-project/vllm/pull/18768
  • [V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank by @mgoin in https://github.com/vllm-project/vllm/pull/19516
  • Fix numel() downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 by @r-barnes in https://github.com/vllm-project/vllm/pull/17082
  • [Misc] add xgrammar for arm64 by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/18359
  • Enable ZP Support for Machete by @czhu-cohere in https://github.com/vllm-project/vllm/pull/20268
  • [CPU] Update custom ops for the CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20255
  • [Bugfix] Fix deepep tests by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20288
  • [Misc] remove redundant char by @kebe7jun in https://github.com/vllm-project/vllm/pull/20287
  • [BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine by @tywuAMD in https://github.com/vllm-project/vllm/pull/19067
  • [doc] fix the incorrect logo in dark mode by @reidliu41 in https://github.com/vllm-project/vllm/pull/20289
  • [Perf] Validate @config in pre-commit instead of dynamically by @lionelvillard in https://github.com/vllm-project/vllm/pull/20200
  • [Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper by @kylesayrs in https://github.com/vllm-project/vllm/pull/20046
  • [Misc] Minor refactor of NIXL background handshake by @NickLucche in https://github.com/vllm-project/vllm/pull/20068
  • Add GLM-4.1V model by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/19331
  • [Model]Add Tencent HunYuanMoEV1 Model Support by @aiyiwang2025 in https://github.com/vllm-project/vllm/pull/20114
  • [Misc] Minor refactoring for scheduler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20299
  • [Docs] Update transcriptions API to use openai client with stream=True by @NickLucche in https://github.com/vllm-project/vllm/pull/20271
  • [CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20301
  • [Frontend] Expand tools even if tool_choice="none" by @okdshin in https://github.com/vllm-project/vllm/pull/17177
  • [V1] [ROCm] Enable EP with AITER Fused MoE by @tjtanaa in https://github.com/vllm-project/vllm/pull/20270
  • [Optimization] Cache sampled token ids in model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20291
  • remove unused variables in marlin_template.h by @zhoutianzi666 in https://github.com/vllm-project/vllm/pull/20236
  • [Refactor] Refactor import utils by @yewentao256 in https://github.com/vllm-project/vllm/pull/20269
  • Enable group size 64 for Machete by @czhu-cohere in https://github.com/vllm-project/vllm/pull/20290
  • [Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20324
  • [UT][intel GPU] use current_platform instead of device hardcode in v1 tests by @Liangliang-Ma in https://github.com/vllm-project/vllm/pull/20169
  • [Refactor] Remove duplicate find_free_port by @yewentao256 in https://github.com/vllm-project/vllm/pull/20333
  • [Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON by @yewentao256 in https://github.com/vllm-project/vllm/pull/20334
  • [Misc][Doc] Add missing comment for LLM by @draftbk in https://github.com/vllm-project/vllm/pull/20285
  • [FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter by @jikunshang in https://github.com/vllm-project/vllm/pull/20348
  • [Bugfix] Fix dynamic rotary embedding by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20343
  • fix[Docs]: link anchor is incorrect #20309 by @yyzxw in https://github.com/vllm-project/vllm/pull/20315
  • [Doc][TPU] Add models and features supporting matrix. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20230
  • [TPU] kv cache update kernel supports dynamic grid by @yaochengji in https://github.com/vllm-project/vllm/pull/20235
  • [Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. by @huachenheli in https://github.com/vllm-project/vllm/pull/20105
  • [Model][VLM] Support Keye-VL-8B-Preview by @Kwai-Keye in https://github.com/vllm-project/vllm/pull/20126
  • [Bugfix] Keye-VL compatibility with tok_kwargs (#20058) by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20353
  • [Docs] Fix indentations for 2-level items in deprecation_policy.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20352
  • [Docs] Make TPU ref prettier in google_tpu.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20356
  • [Model] Add Ernie4.5 and Ernie4.5MoE Model Support by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/20220
  • [Build/CI] Automatically tag DeepSeek related PRs by @houseroad in https://github.com/vllm-project/vllm/pull/20370
  • [NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) by @kaln27 in https://github.com/vllm-project/vllm/pull/17280
  • [Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models by @huaqiangwang in https://github.com/vllm-project/vllm/pull/20322
  • [Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct by @zichongli5 in https://github.com/vllm-project/vllm/pull/20286
  • Documentation update tool_calling: mapping back to function from response by @cronoik-inceptionai in https://github.com/vllm-project/vllm/pull/20373
  • [Kernels] MoE refactor by @bnellnm in https://github.com/vllm-project/vllm/pull/19636
  • [V1] LogitsProcessor programming model by @afeldman-nm in https://github.com/vllm-project/vllm/pull/16728
  • [Minor] Clean up incorrect comment in test by @njhill in https://github.com/vllm-project/vllm/pull/20382
  • [Misc] add handler HF_TOKEN is emptry string by @lengrongfu in https://github.com/vllm-project/vllm/pull/20369
  • [ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) by @vllmellm in https://github.com/vllm-project/vllm/pull/20254
  • [DP] Support external DP Load Balancer mode by @njhill in https://github.com/vllm-project/vllm/pull/19790
  • [Docs] Update EAGLE example by @NickLucche in https://github.com/vllm-project/vllm/pull/20375
  • [Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20136
  • [BugFix] Fix DP headless mode arg validation by @njhill in https://github.com/vllm-project/vllm/pull/20398
  • Enable CPU nightly performance benchmark and its Markdown report by @louie-tsai in https://github.com/vllm-project/vllm/pull/18444
  • [Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py by @bnellnm in https://github.com/vllm-project/vllm/pull/20381
  • [Misc] Small: Fix video loader return type annotations. by @huachenheli in https://github.com/vllm-project/vllm/pull/20389
  • [Bugfix][CI/CD][CPU] Fix CPU CI tests by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20383
  • [TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 by @QiliangCui in https://github.com/vllm-project/vllm/pull/20385
  • [Feature] Support MiniMax-M1 function calls features by @qscqesze in https://github.com/vllm-project/vllm/pull/20297
  • [Tests] Update online DP tests to verify that requests are balanced by @njhill in https://github.com/vllm-project/vllm/pull/20157
  • [Misc] Add rules to label Speculative Decoding Related PRs by @draftbk in https://github.com/vllm-project/vllm/pull/20406
  • [doc] fix link by @reidliu41 in https://github.com/vllm-project/vllm/pull/20417
  • [Docs] Replace two list with tables in intel_gaudi.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20414
  • [Core] Move multimodal placeholder from chat utils to model definition by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20355
  • [Kernel] refactor cpu worker v0 cache dtype by @andyxning in https://github.com/vllm-project/vllm/pull/20080
  • [CI/Build][CPU] Enable cross compilation in CPU release pipeline by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20423
  • [Quantization] Bump to use latest bitsandbytes by @jeejeelee in https://github.com/vllm-project/vllm/pull/20424
  • [Model][2/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/19978
  • [Misc] Automatically tag PRs to add new models by @Isotr0py in https://github.com/vllm-project/vllm/pull/20222
  • [Frontend] improve vllm bench <bench_type> --help display by @reidliu41 in https://github.com/vllm-project/vllm/pull/20430
  • [Bugfix] Fix flaky test_streaming_response test by @NickLucche in https://github.com/vllm-project/vllm/pull/20363
  • [Frontend] fix duplicate output for bench subcmd by @reidliu41 in https://github.com/vllm-project/vllm/pull/20446
  • [CI] Trimming some failing test groups from AMDPRODUCTION. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/20390
  • [Misc] Clean up InternVL family config registration by @Isotr0py in https://github.com/vllm-project/vllm/pull/19992
  • [Misc] adjust for ipv6 for mookcacke url parse by @andyxning in https://github.com/vllm-project/vllm/pull/20107
  • [Misc] Remove _maybe_ignore_quant_config from GLM4.1v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/20432
  • [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in https://github.com/vllm-project/vllm/pull/18864
  • [Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in https://github.com/vllm-project/vllm/pull/20400
  • [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in https://github.com/vllm-project/vllm/pull/19510
  • Change warn_for_unimplemented_methods to debug by @mgoin in https://github.com/vllm-project/vllm/pull/20455
  • [Platform] Add custom default max tokens by @gmarinho2 in https://github.com/vllm-project/vllm/pull/18557
  • Add ignore consolidated file in mistral example code by @princepride in https://github.com/vllm-project/vllm/pull/20420
  • [Misc] small update by @reidliu41 in https://github.com/vllm-project/vllm/pull/20462
  • [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in https://github.com/vllm-project/vllm/pull/20365
  • [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in https://github.com/vllm-project/vllm/pull/20331
  • [Misc] Add SPDX-FileCopyrightText by @jeejeelee in https://github.com/vllm-project/vllm/pull/20428
  • Support Llama 4 for fused_marlin_moe by @mgoin in https://github.com/vllm-project/vllm/pull/20457
  • [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in https://github.com/vllm-project/vllm/pull/18809
  • [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/20168
  • [Doc] Fix classification table in list of supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20489
  • [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in https://github.com/vllm-project/vllm/pull/18193
  • [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in https://github.com/vllm-project/vllm/pull/20395
  • Enable V1 for Hybrid SSM/Attention Models by @tdoublep in https://github.com/vllm-project/vllm/pull/20016
  • [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in https://github.com/vllm-project/vllm/pull/19757
  • [CI Bugfix] Fix pre-commit failures on main by @mgoin in https://github.com/vllm-project/vllm/pull/20502
  • [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in https://github.com/vllm-project/vllm/pull/20497
  • [Misc] Add security warning for development mode endpoints by @reidliu41 in https://github.com/vllm-project/vllm/pull/20508
  • [doc] small fix by @reidliu41 in https://github.com/vllm-project/vllm/pull/20506
  • [Misc] Remove the unused LoRA test code by @jeejeelee in https://github.com/vllm-project/vllm/pull/20494
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in https://github.com/vllm-project/vllm/pull/20507
  • [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in https://github.com/vllm-project/vllm/pull/19754
  • [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in https://github.com/vllm-project/vllm/pull/20510
  • [Misc] remove unused import by @reidliu41 in https://github.com/vllm-project/vllm/pull/20517
  • test_attention compat with coming xformers change by @bottler in https://github.com/vllm-project/vllm/pull/20487
  • [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/20491
  • [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in https://github.com/vllm-project/vllm/pull/20509
  • [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20513
  • [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in https://github.com/vllm-project/vllm/pull/20339
  • [Frontend] Support image object in llm.chat by @sfeng33 in https://github.com/vllm-project/vllm/pull/19635
  • [Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in https://github.com/vllm-project/vllm/pull/20516
  • [Misc] call the pre-defined func by @reidliu41 in https://github.com/vllm-project/vllm/pull/20518
  • [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20412
  • [V1] Support any head size for FlexAttention backend by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20467
  • [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20530
  • [Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20527
  • Implement OpenAI Responses API [1/N] by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20504
  • [Misc] add a tip for pre-commit by @reidliu41 in https://github.com/vllm-project/vllm/pull/20536
  • [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in https://github.com/vllm-project/vllm/pull/19410
  • [CI/Build] Enable phi2 lora test by @jeejeelee in https://github.com/vllm-project/vllm/pull/20540
  • [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in https://github.com/vllm-project/vllm/pull/20537
  • Add docstrings to url_schemes.py to improve readability by @windsonsea in https://github.com/vllm-project/vllm/pull/20545
  • [XPU] log clean up for XPU platform by @yma11 in https://github.com/vllm-project/vllm/pull/20553
  • [Docs] Clean up tables in supported_models.md by @windsonsea in https://github.com/vllm-project/vllm/pull/20552
  • [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in https://github.com/vllm-project/vllm/pull/18878
  • [Misc] Set the minimum openai version by @jeejeelee in https://github.com/vllm-project/vllm/pull/20539
  • [Doc] Remove extra whitespace from CI failures doc by @hmellor in https://github.com/vllm-project/vllm/pull/20565
  • [Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in https://github.com/vllm-project/vllm/pull/20564
  • [Doc] Fix internal links so they don't always point to latest by @hmellor in https://github.com/vllm-project/vllm/pull/20563
  • [Doc] Add outline for content tabs by @hmellor in https://github.com/vllm-project/vllm/pull/20571
  • [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in https://github.com/vllm-project/vllm/pull/20572
  • [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in https://github.com/vllm-project/vllm/pull/19675
  • [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in https://github.com/vllm-project/vllm/pull/20486
  • [Feature] microbatch tokenization by @ztang2370 in https://github.com/vllm-project/vllm/pull/19334
  • [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in https://github.com/vllm-project/vllm/pull/20344
  • [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in https://github.com/vllm-project/vllm/pull/20308
  • [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in https://github.com/vllm-project/vllm/pull/14047
  • [Misc] feat output content in stream response by @lengrongfu in https://github.com/vllm-project/vllm/pull/19608
  • Fix links in multi-modal model contributing page by @hmellor in https://github.com/vllm-project/vllm/pull/18615
  • [Config] Refactor mistral configs by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/20570
  • [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in https://github.com/vllm-project/vllm/pull/20573
  • [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in https://github.com/vllm-project/vllm/pull/20167
  • [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20586
  • Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in https://github.com/vllm-project/vllm/pull/20585
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/19209
  • [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in https://github.com/vllm-project/vllm/pull/20582
  • [Docs] Rewrite offline inference guide by @crypdick in https://github.com/vllm-project/vllm/pull/20594
  • [Docs] Improve docstring for ray data llm example by @crypdick in https://github.com/vllm-project/vllm/pull/20597
  • [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in https://github.com/vllm-project/vllm/pull/20595
  • [Docs] Add Anyscale to frameworks by @crypdick in https://github.com/vllm-project/vllm/pull/20590
  • [Misc] improve error msg by @reidliu41 in https://github.com/vllm-project/vllm/pull/20604
  • [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20560
  • [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in https://github.com/vllm-project/vllm/pull/20278
  • [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in https://github.com/vllm-project/vllm/pull/19619
  • [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in https://github.com/vllm-project/vllm/pull/20139
  • [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in https://github.com/vllm-project/vllm/pull/20601
  • Remove unnecessary explicit title anchors and use relative links instead by @hmellor in https://github.com/vllm-project/vllm/pull/20620
  • Stop using title frontmatter and fix doc that can only be reached by search by @hmellor in https://github.com/vllm-project/vllm/pull/20623
  • [xpu]feat: support multi-lora on xpu by @yma11 in https://github.com/vllm-project/vllm/pull/20616
  • Update torch/xla pin to 20250703 by @vanbasten23 in https://github.com/vllm-project/vllm/pull/20589
  • [Model] Implement missing get_language_model for Keye-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20631
  • Revert invalid spellchecker fix on deepseek_vl2 by @viravera in https://github.com/vllm-project/vllm/pull/20618
  • [CI] Increase the threshold of the MTEB RERANK tests by @noooop in https://github.com/vllm-project/vllm/pull/20615
  • [Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE by @minosfuture in https://github.com/vllm-project/vllm/pull/20166
  • [Core] Rename get_max_tokens_per_item for backward compatibility by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20630
  • [Bugfix] Fix GLM-4.1-V video prompt update by @Isotr0py in https://github.com/vllm-project/vllm/pull/20635
  • [TPU][Bugfix] disable phi-3 test by @QiliangCui in https://github.com/vllm-project/vllm/pull/20632
  • Replace multiply_add with homogeneous_multiply_add to Address Clang Template Parameter Issue by @wenxin0319 in https://github.com/vllm-project/vllm/pull/20142
  • [misc]refactor Platform.set_device method by @jikunshang in https://github.com/vllm-project/vllm/pull/20262
  • [tech debt] Revisit lora request model checker by @kouroshHakha in https://github.com/vllm-project/vllm/pull/20636
  • [BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker by @ratnampa in https://github.com/vllm-project/vllm/pull/20596
  • [Docs] Improve documentation for multi-node service helper script by @crypdick in https://github.com/vllm-project/vllm/pull/20600
  • [Hardware][PPC64LE] Enable V1 for ppc64le and ARM by @Akashcodes732 in https://github.com/vllm-project/vllm/pull/20554
  • [Bugfix] set default set cuda_graph_sizes to min(self.max_num_seqs * 2, 512) by @izhuhaoran in https://github.com/vllm-project/vllm/pull/20628
  • [feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes by @djmmoss in https://github.com/vllm-project/vllm/pull/20640
  • Fix bullets in incremental_build.md by @mgoin in https://github.com/vllm-project/vllm/pull/20642
  • [Misc] Fix the size of batched_dummy_mm_inputs in profile_run by @B-201 in https://github.com/vllm-project/vllm/pull/20434
  • [XPU] Use spawn with XPU multiprocessing by @dvrogozh in https://github.com/vllm-project/vllm/pull/20649
  • [Intel GPU] support ray as distributed executor backend for XPU. by @jikunshang in https://github.com/vllm-project/vllm/pull/20659
  • [Docs] fix minimax tool_calling docs error by @qscqesze in https://github.com/vllm-project/vllm/pull/20667
  • [Bugfix] Fix the issue where reasoning_content is None when Thinkng is enabled and tool_choice is set to 'required'. by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/20662
  • [V1] [Doc] Update V1 docs for Mamba models by @tdoublep in https://github.com/vllm-project/vllm/pull/20499
  • [Doc] Update notes by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20668
  • [Benchmark] Parameterization of streaming loading of multimodal datasets by @Potabk in https://github.com/vllm-project/vllm/pull/20528
  • [Docs] Improve docs for RLHF co-location example by @crypdick in https://github.com/vllm-project/vllm/pull/20599
  • [doc] update doc format by @reidliu41 in https://github.com/vllm-project/vllm/pull/20673
  • [Bugfix] Fix handling of Tensorizer arguments for LoadConfig by @sangstar in https://github.com/vllm-project/vllm/pull/20643
  • [TPU][Bugfix] fix test_pallas by @yaochengji in https://github.com/vllm-project/vllm/pull/20666
  • [XPU][CI] enhance xpu test support by @Liangliang-Ma in https://github.com/vllm-project/vllm/pull/20652
  • [Bench] Add NVFP4 GEMM benchmark script by @mgoin in https://github.com/vllm-project/vllm/pull/20578
  • [Doc] Update CPU doc by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20676
  • Remove heading from installation inc.md file by @hmellor in https://github.com/vllm-project/vllm/pull/20697
  • [CI/Build] Enlarge tolerance for a CPU multi-modal test by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20684
  • Support Llama 4 for cutlass_moe_fp4 by @mgoin in https://github.com/vllm-project/vllm/pull/20453
  • [Kernel] Triton implementation of causal-conv1d for Mamba-based models by @thoangtrvn in https://github.com/vllm-project/vllm/pull/18218
  • [Kernel] Add Conch backend for mixed-precision linear layer by @jmanning-stackav in https://github.com/vllm-project/vllm/pull/19818
  • [Feature][Quantization] MXFP4 support for MOE models by @fxmarty-amd in https://github.com/vllm-project/vllm/pull/17888
  • [BugFix]: Properly set engine_id when using multi connector by @Missmiaom in https://github.com/vllm-project/vllm/pull/19487
  • [Misc] Simplify the prefix caching logic on draft tokens by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20701
  • [CI/Build] Fix FlashInfer double build in Dockerfile by @mgoin in https://github.com/vllm-project/vllm/pull/20651
  • [Misc] DP : Add ExpertTokensMetadata by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20332
  • Use NVCC --compress-mode to reduce binary size by 30% by @mgoin in https://github.com/vllm-project/vllm/pull/20694
  • Correct PPMissingLayer handling in Deepseek-V2-Lite PP deployment by @eicherseiji in https://github.com/vllm-project/vllm/pull/20665
  • [Frontend] Support Tool Calling with both tool_choice='required' and $defs. by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/20629
  • [BugFix][CPU] Fix CPU worker dependency on cumem_allocator by @njhill in https://github.com/vllm-project/vllm/pull/20696
  • [BugFix] Fix VllmConfig() construction on all platforms by @njhill in https://github.com/vllm-project/vllm/pull/20695
  • [TPU][Core]Make load weight exceed hbm error more instructive for customers by @Chenyaaang in https://github.com/vllm-project/vllm/pull/20644
  • [KVConnector] Aggregate finished requests on the scheduler by @orozery in https://github.com/vllm-project/vllm/pull/19555
  • [Misc] loose new-model tagger conditions by @Isotr0py in https://github.com/vllm-project/vllm/pull/20747
  • [CI/Build] Fix Basic Models Test by @jeejeelee in https://github.com/vllm-project/vllm/pull/20728
  • [Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined by @gshtras in https://github.com/vllm-project/vllm/pull/20738
  • [doc] fix ordered list by @reidliu41 in https://github.com/vllm-project/vllm/pull/20749
  • [CI Bugfix] Skip failing Tensorizer+LoRA test by @mgoin in https://github.com/vllm-project/vllm/pull/20724
  • Normalize lm-eval command between baseline and correctness test by @mgoin in https://github.com/vllm-project/vllm/pull/18560
  • [Misc] Clean up mark to fork process in BNB tests by @Isotr0py in https://github.com/vllm-project/vllm/pull/20692
  • [Doc] Add engine args back in to the docs by @hmellor in https://github.com/vllm-project/vllm/pull/20674
  • Update Dockerfile FlashInfer to v0.2.8rc1 by @mgoin in https://github.com/vllm-project/vllm/pull/20718
  • [Hardware][CPU] Vllm int8 quantization enablement for ARM CPU by @nishith-fujitsu in https://github.com/vllm-project/vllm/pull/14129
  • [ROCm][Regression] Remove tensor creation that harms performance on ROCm by @gshtras in https://github.com/vllm-project/vllm/pull/20741
  • [Model] Add reason parser for Hunyuan A13B Model. by @kzjeef in https://github.com/vllm-project/vllm/pull/20625
  • [Model][VLM] Support JinaVL Reranker by @shineran96 in https://github.com/vllm-project/vllm/pull/20260
  • Fix DeepSeek-R1-0528 chat template by @sfbemerk in https://github.com/vllm-project/vllm/pull/20717
  • [Test] Remove docker build from test. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20542
  • [Bugfix] [CI] Fix Tensorizer LoRA test by @sangstar in https://github.com/vllm-project/vllm/pull/20760
  • [V0][V1][Core] Add outlines integration for V1, and update V0 integration. by @unaidedelf8777 in https://github.com/vllm-project/vllm/pull/15975
  • [CI] Fix pre commit issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/20782
  • [Bugfix] Remove assertion of expert_map being None by @minosfuture in https://github.com/vllm-project/vllm/pull/20714
  • [Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/19126
  • [Bugfix] Fused MoE Modular Kernel chunking loop by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20392
  • [KVConnector] Always call connector clear_metadata() at end of step by @njhill in https://github.com/vllm-project/vllm/pull/20756
  • [Misc] MoE ModularKernel : Introduce TopKWeightAndReduce by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20648
  • [Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. by @KuntaiDu in https://github.com/vllm-project/vllm/pull/20786
  • [Docs] Lazy import gguf by @simon-mo in https://github.com/vllm-project/vllm/pull/20785
  • [CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install by @mgoin in https://github.com/vllm-project/vllm/pull/20772
  • Add kimi-k2 tool parser by @MoyanZitto in https://github.com/vllm-project/vllm/pull/20789
  • [fix]: disable cutlass block scaled group gemm for EP by @djmmoss in https://github.com/vllm-project/vllm/pull/20781
  • [Model] Support HF format of minimax by @mgoin in https://github.com/vllm-project/vllm/pull/20211
  • [Attention] MLA - Flashinfer Ragged Prefill by @alexm-redhat in https://github.com/vllm-project/vllm/pull/20034
  • [Feature] Integrate SM100 DeepGEMM support by @yewentao256 in https://github.com/vllm-project/vllm/pull/20087
  • [XPU] XCCL support enabled in torch 2.8.0.dev nightly builds by @ratnampa in https://github.com/vllm-project/vllm/pull/20705
  • [Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf by @ProExpertProg in https://github.com/vllm-project/vllm/pull/19830
  • [V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine by @nopperl in https://github.com/vllm-project/vllm/pull/20660
  • [doc] fold long code block by @reidliu41 in https://github.com/vllm-project/vllm/pull/20795
  • [Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging by @ProExpertProg in https://github.com/vllm-project/vllm/pull/20777
  • [Quantization][1/N] MoE support BNB-Inflight Quantization by @jeejeelee in https://github.com/vllm-project/vllm/pull/20061
  • [Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). by @pavanimajety in https://github.com/vllm-project/vllm/pull/19825
  • [Bugfix] Refactor /invocations to be task-agnostic by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20764
  • Temporarily suspend google/gemma-3-1b-it. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20722
  • [Bugfix] Add missing field to TritonLanguagePlaceholder by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20812
  • [doc] fix ordered list issue by @reidliu41 in https://github.com/vllm-project/vllm/pull/20819
  • [Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20449
  • [Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM by @mgoin in https://github.com/vllm-project/vllm/pull/20646
  • [Docs] Data Parallel deployment documentation by @njhill in https://github.com/vllm-project/vllm/pull/20768
  • [Bugfix] Fix OOM in language generation test by @Isotr0py in https://github.com/vllm-project/vllm/pull/20814
  • Update kimi-k2 tool calling docs, enable unit tests by @MoyanZitto in https://github.com/vllm-project/vllm/pull/20821
  • [CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' by @yewentao256 in https://github.com/vllm-project/vllm/pull/20845
  • Integration SM100 FlashInfer fused allreduce RMSNorm by @ilmarkov in https://github.com/vllm-project/vllm/pull/20691
  • Add pynccl all-gatherv and reducescatterv by @trevor-m in https://github.com/vllm-project/vllm/pull/20154
  • [Misc] Restrict deep_gemm's log output by @jeejeelee in https://github.com/vllm-project/vllm/pull/20827
  • [Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/20822
  • [Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading by @yurhett in https://github.com/vllm-project/vllm/pull/20682
  • [CI/Build] Ensure compatability with Transformers v4.53 by @Isotr0py in https://github.com/vllm-project/vllm/pull/20541
  • [Bugfix] : Fix typo - logger.warn_once -> logger.warning_once by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20852
  • [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models by @NickLucche in https://github.com/vllm-project/vllm/pull/20637
  • [Bugfix] Replace unavailable video url in multimodal test by @Isotr0py in https://github.com/vllm-project/vllm/pull/20854
  • [Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph by @lk-chen in https://github.com/vllm-project/vllm/pull/20834
  • [Bug] Fix DeepGemm for EP low latency case by @yewentao256 in https://github.com/vllm-project/vllm/pull/20833
  • [Docs] Update basic.md by @luccafong in https://github.com/vllm-project/vllm/pull/20846
  • [Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 by @zou3519 in https://github.com/vllm-project/vllm/pull/20823
  • [cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/20790
  • Remove extra tensor on CPU by @maxdebayser in https://github.com/vllm-project/vllm/pull/20693
  • Enable ModelOpt Llama4 fp8 checkpoint deployment by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/20419
  • Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694" by @mgoin in https://github.com/vllm-project/vllm/pull/20853
  • [Model] New model support for microsoft/Phi-4-mini-flash-reasoning by @congcongchen123 in https://github.com/vllm-project/vllm/pull/20702
  • [Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/20843
  • [docs] convert supported configs to table by @reidliu41 in https://github.com/vllm-project/vllm/pull/20858
  • [Bugfix] Restrict Machete to only run on Hopper by @mgoin in https://github.com/vllm-project/vllm/pull/20830
  • [Sched] Enhance the logic to remove stopped requests from queues by @WoosukKwon in https://github.com/vllm-project/vllm/pull/20739
  • [Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant by @yewentao256 in https://github.com/vllm-project/vllm/pull/20841
  • [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs by @ElizaWszola in https://github.com/vllm-project/vllm/pull/20825
  • [Refactor] Change the way of import triton by @yewentao256 in https://github.com/vllm-project/vllm/pull/20774
  • [Core] Support multiple tasks per model by @NickLucche in https://github.com/vllm-project/vllm/pull/20771
  • Renable google/gemma-3-1b-it accuracy test. by @QiliangCui in https://github.com/vllm-project/vllm/pull/20866
  • Support for LlamaForSequenceClassification by @thechaos16 in https://github.com/vllm-project/vllm/pull/20807
  • [Bugfix] Fix: add patch_rope_scaling after hf override by @Wangmerlyn in https://github.com/vllm-project/vllm/pull/20857
  • [Bugfix] fix define of RerankDocument by @Liuchenlong in https://github.com/vllm-project/vllm/pull/20877
  • [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs by @tjtanaa in https://github.com/vllm-project/vllm/pull/20880
  • [V1] Hybrid allocator without prefix caching by @nopperl in https://github.com/vllm-project/vllm/pull/20661
  • [Core] Add update_config RPC method by @22quinn in https://github.com/vllm-project/vllm/pull/20095
  • [Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) by @vMaroon in https://github.com/vllm-project/vllm/pull/20511
  • Removing redundant python version check by @Dannyso05 in https://github.com/vllm-project/vllm/pull/20888
  • Fix: Add missing EOFError handling in CLI complete command by @reidliu41 in https://github.com/vllm-project/vllm/pull/20896
  • [ROCm] [Bugfix] [Critical]: Fix mamba compilation bug by @tjtanaa in https://github.com/vllm-project/vllm/pull/20883
  • [Quantization] add BNB for MixtralForCausalLM by @jeejeelee in https://github.com/vllm-project/vllm/pull/20893
  • [Refactor][V1] Move outlines utils for V1 imports by @aarnphm in https://github.com/vllm-project/vllm/pull/20878
  • [MISC] Move bind_kv_cache to worker module by @wangxiyuan in https://github.com/vllm-project/vllm/pull/20900
  • [CI/Build] Fix OOM issue in Jina-VL test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20907
  • [Bugfix] Bump up mistral_common to support v13 tokenizer by @22quinn in https://github.com/vllm-project/vllm/pull/20905
  • [Misc] Remove unused function by @reidliu41 in https://github.com/vllm-project/vllm/pull/20909
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/20910
  • [Misc] Log the reason for falling back to FlexAttention by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20699
  • [Model] Add Ling implementation by @ant-yy in https://github.com/vllm-project/vllm/pull/20680
  • [CI] cc folks on changes to vllm/compilation by @zou3519 in https://github.com/vllm-project/vllm/pull/20925
  • [CI] Update codeowner for compilation code by @houseroad in https://github.com/vllm-project/vllm/pull/20929
  • [Misc] Clean up Aimv2 config registration in Ovis config by @Isotr0py in https://github.com/vllm-project/vllm/pull/20921
  • [CI/Build] Add Transformers nightly tests in CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/20924
  • Change default model to Qwen3-0.6B by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/20335
  • Add benchmark dataset for mlperf llama tasks by @mgoin in https://github.com/vllm-project/vllm/pull/20338
  • [Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20725
  • [Misc] Relax translations tests by @NickLucche in https://github.com/vllm-project/vllm/pull/20856
  • Fix overflow indexing in causal_conv1d kernel by @tdoublep in https://github.com/vllm-project/vllm/pull/20938
  • [Docs] remove outdated performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/20935
  • Fall back if flashinfer comm module not found by @sarckk in https://github.com/vllm-project/vllm/pull/20936
  • SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP by @alexm-redhat in https://github.com/vllm-project/vllm/pull/20769
  • [BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache by @zou3519 in https://github.com/vllm-project/vllm/pull/20942
  • [Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM by @mgoin in https://github.com/vllm-project/vllm/pull/20933
  • [CI/Build] Split Entrypoints Test into LLM and API Server by @mgoin in https://github.com/vllm-project/vllm/pull/20945
  • Use w8a8 quantized matmul Pallas kernel by @vanbasten23 in https://github.com/vllm-project/vllm/pull/19170
  • [Docs] Add Kuberay to deployment integrations by @crypdick in https://github.com/vllm-project/vllm/pull/20592
  • feat: add image zoom to improve image viewing experience by @reidliu41 in https://github.com/vllm-project/vllm/pull/20763
  • [CI] Fix flaky test_streaming_response test by @NickLucche in https://github.com/vllm-project/vllm/pull/20913
  • Enabled BnB NF4 inference on Gaudi by @rsshaik1 in https://github.com/vllm-project/vllm/pull/20172
  • [Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer by @pavanimajety in https://github.com/vllm-project/vllm/pull/20934
  • [Doc] Clearer mistral3 and pixtral model support description by @Isotr0py in https://github.com/vllm-project/vllm/pull/20926
  • [cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/20940
  • [Model] Add AutoWeightsLoader support for BERT, RoBERTa by @jennifurhe in https://github.com/vllm-project/vllm/pull/20534
  • Implement Async Scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/19970
  • [Misc] Refactor AllReduceFusionPass. Remove parameter by @ilmarkov in https://github.com/vllm-project/vllm/pull/20918
  • [frontend] Add --help=page option for paginated help output by @reidliu41 in https://github.com/vllm-project/vllm/pull/20961
  • [Docs] Improve documentation for RLHF example by @crypdick in https://github.com/vllm-project/vllm/pull/20598
  • [frontend] Refactor CLI Args for a better modular integration by @kouroshHakha in https://github.com/vllm-project/vllm/pull/20206
  • [Docs] Improve documentation for ray cluster launcher helper script by @crypdick in https://github.com/vllm-project/vllm/pull/20602
  • [TPU] Optimize kv cache update kernel by @tengyifei in https://github.com/vllm-project/vllm/pull/20415
  • [V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli by @tdoublep in https://github.com/vllm-project/vllm/pull/20840
  • [MISC] Add init files for python package by @Potabk in https://github.com/vllm-project/vllm/pull/20908
  • [doc] Add more details for Ray-based DP by @ruisearch42 in https://github.com/vllm-project/vllm/pull/20948
  • [Deprecation] Remove TokenizerPoolConfig by @hmellor in https://github.com/vllm-project/vllm/pull/20968
  • [v1][core] Support for attention free models by @christian-pinto in https://github.com/vllm-project/vllm/pull/20811
  • Voxtral by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/20970
  • [CI/Build] Fix wrong path in Transformers Nightly Models Test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20994
  • [Deprecation] Remove everything scheduled for removal in v0.10.0 by @hmellor in https://github.com/vllm-project/vllm/pull/20979
  • Configure Gemini by @hmellor in https://github.com/vllm-project/vllm/pull/20971
  • [Deprecation] Remove nullable_kvs by @hmellor in https://github.com/vllm-project/vllm/pull/20969
  • Add full serve CLI reference back to docs by @hmellor in https://github.com/vllm-project/vllm/pull/20978
  • [ROCm] warpSize is being made non constexpr in ROCm 7.0 by @gshtras in https://github.com/vllm-project/vllm/pull/20330
  • [BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 by @thoangtrvn in https://github.com/vllm-project/vllm/pull/20838
  • [Frontend] Support cache_salt in /v1/completions and /v1/responses by @dr75 in https://github.com/vllm-project/vllm/pull/20981
  • [Bug Fix] get_distributed_init_method should get the ip from get_ip i… by @Relics in https://github.com/vllm-project/vllm/pull/20889
  • [Nvidia] Integrate SM100 cudnn prefill API to MLA prefill by @elfiegg in https://github.com/vllm-project/vllm/pull/20411
  • [Frontend] OpenAI Responses API supports input image by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/20975
  • [Frontend] Remove print left in FrontendArgs.add_cli_args by @mgoin in https://github.com/vllm-project/vllm/pull/21004
  • [Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture by @tdoublep in https://github.com/vllm-project/vllm/pull/20923
  • [Misc] bump xgrammar version to v0.1.21 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/20992
  • [Chore] Remove outdated transformers check by @b8zhong in https://github.com/vllm-project/vllm/pull/20989
  • [Misc] Refactor: Improve argument handling for conda command by @reidliu41 in https://github.com/vllm-project/vllm/pull/20481
  • [Docs] Enhance Anyscale documentation, add quickstart links for vLLM by @crypdick in https://github.com/vllm-project/vllm/pull/21018
  • [Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… by @minosfuture in https://github.com/vllm-project/vllm/pull/20937
  • Add Dockerfile argument for VLLM_USE_PRECOMPILED environment by @dougbtv in https://github.com/vllm-project/vllm/pull/20943
  • [CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty by @xuechendi in https://github.com/vllm-project/vllm/pull/21006
  • [Bugfix] Fix Mistral3 support on SM100/SM120 by @mgoin in https://github.com/vllm-project/vllm/pull/20998
  • [Doc] Remove duplicate docstring by @yewentao256 in https://github.com/vllm-project/vllm/pull/21012
  • [Voxtral] Add more tests by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/21010
  • Avoid direct comparison of floating point numbers by @maxdebayser in https://github.com/vllm-project/vllm/pull/21002
  • [Meta] Llama4 EAGLE Support by @morgendave in https://github.com/vllm-project/vllm/pull/20591
  • [TPU] fix kv_cache_update kernel block size choosing logic by @yaochengji in https://github.com/vllm-project/vllm/pull/21007
  • [BugFix] Fix import error on non-blackwell machines by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21020
  • Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests by @eicherseiji in https://github.com/vllm-project/vllm/pull/20831
  • [Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md by @windsonsea in https://github.com/vllm-project/vllm/pull/19199
  • [Model] Consolidate pooler implementations by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/20927
  • feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information by @m-misiura in https://github.com/vllm-project/vllm/pull/20575
  • [fix] fix qwen image_embeds input by @h-avsha in https://github.com/vllm-project/vllm/pull/21049
  • Remove Qwen Omni workaround that's no longer necessary by @hmellor in https://github.com/vllm-project/vllm/pull/21057
  • [Model] Remove model sampler by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21059
  • Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) by @nirda7 in https://github.com/vllm-project/vllm/pull/12010
  • Remove torch_xla.tpu.version() from pallas.py. by @QiliangCui in https://github.com/vllm-project/vllm/pull/21065
  • Update PyTorch to torch==2.7.1 for CUDA by @mgoin in https://github.com/vllm-project/vllm/pull/21011
  • [Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group by @Kevin-XiongC in https://github.com/vllm-project/vllm/pull/21024
  • [Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile by @mgoin in https://github.com/vllm-project/vllm/pull/21013
  • [TPU] Start using python 3.12 by @vanbasten23 in https://github.com/vllm-project/vllm/pull/21000
  • [Bugfix] Fix Machete zero point issue for GPTQ models on SM90 by @mgoin in https://github.com/vllm-project/vllm/pull/21066
  • [Attention] Refactor attention metadata builder interface by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/20466
  • [V1][P/D]Enhance Performance and code readability for P2pNcclConnector by @Abatom in https://github.com/vllm-project/vllm/pull/20906
  • [V1] [KVConnector] Fix MultiprocExecutor worker output aggregation by @sdavidbd in https://github.com/vllm-project/vllm/pull/21048
  • [Misc] Fix PhiMoE expert mapping by @jeejeelee in https://github.com/vllm-project/vllm/pull/21085
  • [Bugfix]: Fix final_res_batch list index out of range error by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/21055
  • [Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/20903
  • [Model] Add ToolParser and MoE Config for Hunyuan A13B by @kzjeef in https://github.com/vllm-project/vllm/pull/20820
  • [VLM] Add Nemotron-Nano-VL-8B-V1 support by @kylehh in https://github.com/vllm-project/vllm/pull/20349
  • [Docs] Improve docstring formatting for FusedMoEParallelConfig.make by @hmellor in https://github.com/vllm-project/vllm/pull/21117
  • [Misc] Avoid unnecessary import by @wangxiyuan in https://github.com/vllm-project/vllm/pull/21106
  • [Docs] Move code block out of admonition now that it's short by @hmellor in https://github.com/vllm-project/vllm/pull/21118
  • [Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE by @ElizaWszola in https://github.com/vllm-project/vllm/pull/20762
  • [Model] Update pooling model interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21058
  • [Misc] Qwen MoE model supports LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/20932
  • On environments where numa cannot be detected we get 0 by @ericcurtin in https://github.com/vllm-project/vllm/pull/21115
  • [V0 deprecation] Remove V0 HPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21131
  • [Log] Debugging Log with more Information by @yewentao256 in https://github.com/vllm-project/vllm/pull/20770
  • [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/21133
  • [Docs] Add minimal demo of Ray Data API usage by @crypdick in https://github.com/vllm-project/vllm/pull/21080
  • [Docs] Update supported models documentation with missing models by @luccafong in https://github.com/vllm-project/vllm/pull/20844
  • [Attention] Make local attention backend agnostic by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21093
  • [Doc] Add inplace weights loading example by @22quinn in https://github.com/vllm-project/vllm/pull/19640
  • [Core] FlashInfer CUTLASS fused MoE backend (NVFP4) by @wenscarl in https://github.com/vllm-project/vllm/pull/20037
  • [Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm by @shixianc in https://github.com/vllm-project/vllm/pull/20911
  • [Misc] Do not print async output warning for v1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21151
  • [benchmark] Sending request strictly follows the random intervals by @Jialin in https://github.com/vllm-project/vllm/pull/21108
  • [Misc] Make MM embedding merge interface explicit in model runner by @ywang96 in https://github.com/vllm-project/vllm/pull/21147
  • [Model] Re-add the implicit conversion feature for as_seq_cls_model by @noooop in https://github.com/vllm-project/vllm/pull/21103
  • [Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. by @noooop in https://github.com/vllm-project/vllm/pull/20750
  • [Doc] Fix typo in model name by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21178
  • [Bugfix] Allocate less memory in non-batched CUTLASS MoE by @ElizaWszola in https://github.com/vllm-project/vllm/pull/21121
  • [Core] Set pooling params based on task and model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21128
  • Let GraniteMoeAttention use YaRN by @tdoublep in https://github.com/vllm-project/vllm/pull/21174
  • [CI] Update CODEOWNERS for vllm/compilation by @zou3519 in https://github.com/vllm-project/vllm/pull/21185
  • [Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 by @zou3519 in https://github.com/vllm-project/vllm/pull/19346
  • [Core] Avoid KVCacheBlock.eq invocations in FreeKVCacheBlockQueue by @JialinOuyang-Meta in https://github.com/vllm-project/vllm/pull/21005
  • [Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) by @hax0r31337 in https://github.com/vllm-project/vllm/pull/21077
  • Elastic Expert Parallel Initial Support by @ruisearch42 in https://github.com/vllm-project/vllm/pull/20775
  • [Quantization] Enable BNB support for more MoE models by @jeejeelee in https://github.com/vllm-project/vllm/pull/21100
  • [Core] Support Local Chunked Attention for Hybrid KV Cache by @luccafong in https://github.com/vllm-project/vllm/pull/19351
  • [Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/21183
  • [V0 Deprecation] Remove V0 Spec Decode workers by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21152
  • [Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/21193
  • [BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21200
  • [Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 by @yewentao256 in https://github.com/vllm-project/vllm/pull/21187
  • [Model] EXAONE 4.0 model support by @Deepfocused in https://github.com/vllm-project/vllm/pull/21060
  • [Misc][Tools][Benchmark] Add readme file for auto_tune script by @Chenyaaang in https://github.com/vllm-project/vllm/pull/20779
  • Fix a couple of Voxtral tests by @huydhn in https://github.com/vllm-project/vllm/pull/21218
  • [V0 deprecation] Remove long context LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/21169
  • [Bugfix] Fix ndarray video color from VideoAsset by @Isotr0py in https://github.com/vllm-project/vllm/pull/21064
  • [BugFix] Fix potential cuda-graph IMA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21196
  • Add torch golden impl for moe_align_block_size kernel test by @shixianc in https://github.com/vllm-project/vllm/pull/20653
  • [NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency by @kaixih in https://github.com/vllm-project/vllm/pull/20645
  • [Bugfix][Frontend] Fix openai CLI arg middleware by @22quinn in https://github.com/vllm-project/vllm/pull/21220
  • [bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/21032
  • Fix/remove some broken model executor tests by @rabi in https://github.com/vllm-project/vllm/pull/21224
  • [CI/CD][bugfix]fix: error argument to loads has incompatible type by @llsj14 in https://github.com/vllm-project/vllm/pull/21223
  • [Docs] Update the link to the 'Prometheus/Grafana' example by @1195343015 in https://github.com/vllm-project/vllm/pull/21225
  • [BugFix] Make PD work with Ray by @kouroshHakha in https://github.com/vllm-project/vllm/pull/21072
  • [V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers by @tdoublep in https://github.com/vllm-project/vllm/pull/21194
  • [V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21217
  • [BugFix] Fix full cuda graph slot_mapping by @fhl2000 in https://github.com/vllm-project/vllm/pull/21228
  • GLM-4 Update by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/20736
  • [Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. by @tdoublep in https://github.com/vllm-project/vllm/pull/21233
  • [TPU] support fp8 kv cache quantization by @yaochengji in https://github.com/vllm-project/vllm/pull/19292
  • Enable v1 metrics tests by @eicherseiji in https://github.com/vllm-project/vllm/pull/20953
  • [Model] use AutoWeightsLoader for bart by @calvin0327 in https://github.com/vllm-project/vllm/pull/18299
  • [Model] Support VLMs with transformers backend by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/20543
  • [bugfix] fix syntax warning caused by backslash by @1195343015 in https://github.com/vllm-project/vllm/pull/21251
  • [CI] Cleanup modelscope version constraint in Dockerfile by @yankay in https://github.com/vllm-project/vllm/pull/21243
  • [Docs] Add RFC Meeting to Issue Template by @simon-mo in https://github.com/vllm-project/vllm/pull/21279
  • Add the instruction to run e2e validation manually before release by @huydhn in https://github.com/vllm-project/vllm/pull/21023
  • [Bugfix] Fix missing placeholder in logger debug by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21280
  • [Model][1/N] Support multiple poolers at model level by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21227
  • [Docs] Fix hardcoded links in docs by @hmellor in https://github.com/vllm-project/vllm/pull/21287
  • [Docs] Make tables more space efficient in supported_models.md by @hmellor in https://github.com/vllm-project/vllm/pull/21291
  • [Misc] unify variable for LLM instance by @andyxning in https://github.com/vllm-project/vllm/pull/20996
  • Add Nvidia ModelOpt config adaptation by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/19815
  • [Misc] Add sliding window to flashinfer test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21282
  • [CPU] Enable shared-memory based pipeline parallel for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/21289
  • [BugFix] make utils.current_stream thread-safety (#21252) by @simpx in https://github.com/vllm-project/vllm/pull/21253
  • [Misc] Add dummy maverick test by @minosfuture in https://github.com/vllm-project/vllm/pull/21199
  • [Attention] Clean up iRoPE in V1 by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21188
  • [DP] Fix Prometheus Logging by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/21257
  • Fix bad lm-eval fork by @mgoin in https://github.com/vllm-project/vllm/pull/21318
  • [perf] Speed up align sum kernels by @hj-mistral in https://github.com/vllm-project/vllm/pull/21079
  • [v1][sampler] Inplace logprobs comparison to get the token rank by @houseroad in https://github.com/vllm-project/vllm/pull/21283
  • [XPU] Enable external_launcher to serve as an executor via torchrun by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/21021
  • [Doc] Fix CPU doc format by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/21316
  • [Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU by @ratnampa in https://github.com/vllm-project/vllm/pull/21338
  • Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) by @minosfuture in https://github.com/vllm-project/vllm/pull/21334
  • [Core] Minimize number of dict lookup in _maybe_evict_cached_block by @Jialin in https://github.com/vllm-project/vllm/pull/21281
  • [V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible by @tdoublep in https://github.com/vllm-project/vllm/pull/21300
  • [Refactor] Fix Compile Warning #1444-D by @yewentao256 in https://github.com/vllm-project/vllm/pull/21208
  • Fix kv_cache_dtype handling for out-of-tree HPU plugin by @kzawora-intel in https://github.com/vllm-project/vllm/pull/21302
  • [Misc] DeepEPHighThroughtput - Enable Inductor pass by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/21311
  • [Bug] DeepGemm: Fix Cuda Init Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/21312
  • Update fp4 quantize API by @wenscarl in https://github.com/vllm-project/vllm/pull/21327
  • [Feature][eplb] add verify ep or tp or dp by @lengrongfu in https://github.com/vllm-project/vllm/pull/21102
  • Add arcee model by @alyosha-swamy in https://github.com/vllm-project/vllm/pull/21296
  • [Bugfix] Fix eviction cached blocked logic by @simon-mo in https://github.com/vllm-project/vllm/pull/21357
  • [Misc] Remove deprecated args in v0.10 by @kebe7jun in https://github.com/vllm-project/vllm/pull/21349
  • [Core] Optimize update checks in LogitsProcessor by @Jialin in https://github.com/vllm-project/vllm/pull/21245
  • [benchmark] Port benchmark request sent optimization to benchmark_serving by @Jialin in https://github.com/vllm-project/vllm/pull/21209
  • [Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool by @Jialin in https://github.com/vllm-project/vllm/pull/21222
  • [Misc] unify variable for LLM instance v2 by @andyxning in https://github.com/vllm-project/vllm/pull/21356
  • [perf] Add fused MLA QKV + strided layernorm by @mickaelseznec in https://github.com/vllm-project/vllm/pull/21116
  • [feat]: add SM100 support for cutlass FP8 groupGEMM by @djmmoss in https://github.com/vllm-project/vllm/pull/20447
  • [Perf] Cuda Kernel for Per Token Group Quant by @yewentao256 in https://github.com/vllm-project/vllm/pull/21083
  • Adds parallel model weight loading for runai_streamer by @bbartels in https://github.com/vllm-project/vllm/pull/21330
  • [feat] Enable mm caching for transformers backend by @zucchini-nlp in https://github.com/vllm-project/vllm/pull/21358
  • Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" by @yewentao256 in https://github.com/vllm-project/vllm/pull/21384
  • Add tokenization_kwargs to encode for embedding model truncation by @Receiling in https://github.com/vllm-project/vllm/pull/21033
  • [Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers by @ariG23498 in https://github.com/vllm-project/vllm/pull/21353
  • [CI/Build] Fix test failure due to updated model repo by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21375
  • Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num by @xinli-git in https://github.com/vllm-project/vllm/pull/21325
  • [Model] Add Qwen3CoderToolParser by @ranpox in https://github.com/vllm-project/vllm/pull/21396
  • [Misc] Copy HF_TOKEN env var to Ray workers by @ruisearch42 in https://github.com/vllm-project/vllm/pull/21406
  • [BugFix] Fix ray import error mem cleanup bug by @joerunde in https://github.com/vllm-project/vllm/pull/21381
  • [CI/Build] Fix model executor tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21387
  • [Bugfix][ROCm][Build] Fix build regression on ROCm by @gshtras in https://github.com/vllm-project/vllm/pull/21393
  • Simplify weight loading in Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/21382
  • [BugFix] Update python to python3 calls for image; fix prefix & input calculations. by @ericehanley in https://github.com/vllm-project/vllm/pull/21391
  • [BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update by @xuechendi in https://github.com/vllm-project/vllm/pull/21414
  • [Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported by @elvischenv in https://github.com/vllm-project/vllm/pull/21420
  • Changing "amdproduction" allocation. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/21409
  • [Bugfix] Fix nightly transformers CI failure by @Isotr0py in https://github.com/vllm-project/vllm/pull/21427
  • [Core] Add basic unit test for maybe_evict_cached_block by @Jialin in https://github.com/vllm-project/vllm/pull/21400
  • [Cleanup] Only log MoE DP setup warning if DP is enabled by @mgoin in https://github.com/vllm-project/vllm/pull/21315
  • add clear messages for deprecated models by @youkaichao in https://github.com/vllm-project/vllm/pull/21424
  • [Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload by @gcalmettes in https://github.com/vllm-project/vllm/pull/19679
  • Fixed typo in profiling logs by @sergiopaniego in https://github.com/vllm-project/vllm/pull/21441
  • [Docs] Fix bullets and grammars in tool_calling.md by @windsonsea in https://github.com/vllm-project/vllm/pull/21440
  • [Sampler] Introduce logprobs mode for logging by @houseroad in https://github.com/vllm-project/vllm/pull/21398
  • Mamba V2 Test not Asserting Failures. by @fabianlim in https://github.com/vllm-project/vllm/pull/21379
  • [Misc] fixed nvfp4_moe test failures due to invalid kwargs by @chenyang78 in https://github.com/vllm-project/vllm/pull/21246
  • [Docs] Clean up v1/metrics.md by @windsonsea in https://github.com/vllm-project/vllm/pull/21449
  • [Model] add Hunyuan V1 Dense Model support. by @kzjeef in https://github.com/vllm-project/vllm/pull/21368
  • [V1] Check all pooling tasks during profiling by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21299
  • [Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. by @sighingnow in https://github.com/vllm-project/vllm/pull/21364
  • [Tests] Add tests for headless internal DP LB by @njhill in https://github.com/vllm-project/vllm/pull/21450
  • [Core][Model] PrithviMAE Enablement on vLLM v1 engine by @christian-pinto in https://github.com/vllm-project/vllm/pull/20577
  • Add test case for compiling multiple graphs by @sarckk in https://github.com/vllm-project/vllm/pull/21044
  • [TPU][TEST] Fix the downloading issue in TPU v1 test 11. by @QiliangCui in https://github.com/vllm-project/vllm/pull/21418
  • [Core] Add reload_weights RPC method by @22quinn in https://github.com/vllm-project/vllm/pull/20096
  • [V1] Fix local chunked attention always disabled by @sarckk in https://github.com/vllm-project/vllm/pull/21419
  • [V0 Deprecation] Remove Prompt Adapters by @mgoin in https://github.com/vllm-project/vllm/pull/20588
  • [Core] Freeze gc during cuda graph capture to speed up init by @mgoin in https://github.com/vllm-project/vllm/pull/21146
  • feat(gguf_loader): accept HF repo paths & URLs for GGUF by @hardikkgupta in https://github.com/vllm-project/vllm/pull/20793
  • [Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding by @deven-labovitch in https://github.com/vllm-project/vllm/pull/21374
  • [Misc] Add dummy maverick test to CI by @minosfuture in https://github.com/vllm-project/vllm/pull/21324
  • [XPU][UT] increase intel xpu CI test scope by @Liangliang-Ma in https://github.com/vllm-project/vllm/pull/21492
  • [Bugfix] Fix casing warning by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/21468
  • [Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process by @david6666666 in https://github.com/vllm-project/vllm/pull/21437
  • [BugFix]: Batch generation from prompt_embeds fails for long prompts by @KazusatoOoko in https://github.com/vllm-project/vllm/pull/21390
  • [BugFix] Fix KVConnector TP worker aggregation by @njhill in https://github.com/vllm-project/vllm/pull/21473
  • [DP] Internal Load Balancing Per Node [one-pod-per-node] by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/21238
  • Dump input metadata on crash for async scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21258
  • [BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses by @yinghai in https://github.com/vllm-project/vllm/pull/21211
  • Add think chunk by @juliendenize in https://github.com/vllm-project/vllm/pull/21333

New Contributors

  • @py-andy-c made their first contribution in https://github.com/vllm-project/vllm/pull/19399
  • @2niuhe made their first contribution in https://github.com/vllm-project/vllm/pull/19394
  • @leopardracer made their first contribution in https://github.com/vllm-project/vllm/pull/19442
  • @artetaout made their first contribution in https://github.com/vllm-project/vllm/pull/19085
  • @runzhen made their first contribution in https://github.com/vllm-project/vllm/pull/19453
  • @strutive07 made their first contribution in https://github.com/vllm-project/vllm/pull/19522
  • @mobicham made their first contribution in https://github.com/vllm-project/vllm/pull/19265
  • @kouroshHakha made their first contribution in https://github.com/vllm-project/vllm/pull/19378
  • @BoyuanFeng made their first contribution in https://github.com/vllm-project/vllm/pull/19587
  • @sahelib25 made their first contribution in https://github.com/vllm-project/vllm/pull/18354
  • @jiahanc made their first contribution in https://github.com/vllm-project/vllm/pull/19500
  • @quanliu1991 made their first contribution in https://github.com/vllm-project/vllm/pull/18957
  • @f14-bertolotti made their first contribution in https://github.com/vllm-project/vllm/pull/19564
  • @Navanit-git made their first contribution in https://github.com/vllm-project/vllm/pull/19557
  • @nguyenhoangthuan99 made their first contribution in https://github.com/vllm-project/vllm/pull/19597
  • @diliu0349 made their first contribution in https://github.com/vllm-project/vllm/pull/19600
  • @Zzz9990 made their first contribution in https://github.com/vllm-project/vllm/pull/18596
  • @yhtang made their first contribution in https://github.com/vllm-project/vllm/pull/19788
  • @zsolt-borbely-htec made their first contribution in https://github.com/vllm-project/vllm/pull/19803
  • @zuxin666 made their first contribution in https://github.com/vllm-project/vllm/pull/17148
  • @NekoMimiUnagi made their first contribution in https://github.com/vllm-project/vllm/pull/19301
  • @xzbdmw made their first contribution in https://github.com/vllm-project/vllm/pull/19735
  • @Xerxes-cn made their first contribution in https://github.com/vllm-project/vllm/pull/19860
  • @nie3e made their first contribution in https://github.com/vllm-project/vllm/pull/19663
  • @vladmihailescu made their first contribution in https://github.com/vllm-project/vllm/pull/18777
  • @rabinadk1 made their first contribution in https://github.com/vllm-project/vllm/pull/19910
  • @amitm02 made their first contribution in https://github.com/vllm-project/vllm/pull/19057
  • @jinqinn made their first contribution in https://github.com/vllm-project/vllm/pull/19544
  • @Flink-ddd made their first contribution in https://github.com/vllm-project/vllm/pull/19643
  • @Jun-Howie made their first contribution in https://github.com/vllm-project/vllm/pull/19395
  • @seemethere made their first contribution in https://github.com/vllm-project/vllm/pull/20032
  • @h-avsha made their first contribution in https://github.com/vllm-project/vllm/pull/19984
  • @max-wittig made their first contribution in https://github.com/vllm-project/vllm/pull/19695
  • @lsz05 made their first contribution in https://github.com/vllm-project/vllm/pull/20067
  • @kyolebu made their first contribution in https://github.com/vllm-project/vllm/pull/20135
  • @lihaoyang-amd made their first contribution in https://github.com/vllm-project/vllm/pull/19744
  • @Yazan-Sharaya made their first contribution in https://github.com/vllm-project/vllm/pull/19946
  • @ilyal-cerebras made their first contribution in https://github.com/vllm-project/vllm/pull/20065
  • @fabiendupont made their first contribution in https://github.com/vllm-project/vllm/pull/18064
  • @SHA-4096 made their first contribution in https://github.com/vllm-project/vllm/pull/19700
  • @1195343015 made their first contribution in https://github.com/vllm-project/vllm/pull/20185
  • @redmoe-moutain made their first contribution in https://github.com/vllm-project/vllm/pull/18254
  • @noiji made their first contribution in https://github.com/vllm-project/vllm/pull/19598
  • @chewong made their first contribution in https://github.com/vllm-project/vllm/pull/15897
  • @sakogan made their first contribution in https://github.com/vllm-project/vllm/pull/18768
  • @czhu-cohere made their first contribution in https://github.com/vllm-project/vllm/pull/20268
  • @aiyiwang2025 made their first contribution in https://github.com/vllm-project/vllm/pull/20114
  • @zhoutianzi666 made their first contribution in https://github.com/vllm-project/vllm/pull/20236
  • @Liangliang-Ma made their first contribution in https://github.com/vllm-project/vllm/pull/20169
  • @yyzxw made their first contribution in https://github.com/vllm-project/vllm/pull/20315
  • @Kwai-Keye made their first contribution in https://github.com/vllm-project/vllm/pull/20126
  • @CSWYF3634076 made their first contribution in https://github.com/vllm-project/vllm/pull/20220
  • @kaln27 made their first contribution in https://github.com/vllm-project/vllm/pull/17280
  • @huaqiangwang made their first contribution in https://github.com/vllm-project/vllm/pull/20322
  • @zichongli5 made their first contribution in https://github.com/vllm-project/vllm/pull/20286
  • @cronoik-inceptionai made their first contribution in https://github.com/vllm-project/vllm/pull/20373
  • @sangbumlikeagod made their first contribution in https://github.com/vllm-project/vllm/pull/18809
  • @djmmoss made their first contribution in https://github.com/vllm-project/vllm/pull/19757
  • @GuyStone made their first contribution in https://github.com/vllm-project/vllm/pull/20497
  • @bottler made their first contribution in https://github.com/vllm-project/vllm/pull/20487
  • @dbyoung18 made their first contribution in https://github.com/vllm-project/vllm/pull/19410
  • @Abirdcfly made their first contribution in https://github.com/vllm-project/vllm/pull/18878
  • @Dekakhrone made their first contribution in https://github.com/vllm-project/vllm/pull/14047
  • @viravera made their first contribution in https://github.com/vllm-project/vllm/pull/20618
  • @wenxin0319 made their first contribution in https://github.com/vllm-project/vllm/pull/20142
  • @ratnampa made their first contribution in https://github.com/vllm-project/vllm/pull/20596
  • @dvrogozh made their first contribution in https://github.com/vllm-project/vllm/pull/20649
  • @thoangtrvn made their first contribution in https://github.com/vllm-project/vllm/pull/18218
  • @jmanning-stackav made their first contribution in https://github.com/vllm-project/vllm/pull/19818
  • @Missmiaom made their first contribution in https://github.com/vllm-project/vllm/pull/19487
  • @orozery made their first contribution in https://github.com/vllm-project/vllm/pull/19555
  • @nishith-fujitsu made their first contribution in https://github.com/vllm-project/vllm/pull/14129
  • @kzjeef made their first contribution in https://github.com/vllm-project/vllm/pull/20625
  • @shineran96 made their first contribution in https://github.com/vllm-project/vllm/pull/20260
  • @unaidedelf8777 made their first contribution in https://github.com/vllm-project/vllm/pull/15975
  • @MoyanZitto made their first contribution in https://github.com/vllm-project/vllm/pull/20789
  • @nopperl made their first contribution in https://github.com/vllm-project/vllm/pull/20660
  • @trevor-m made their first contribution in https://github.com/vllm-project/vllm/pull/20154
  • @yurhett made their first contribution in https://github.com/vllm-project/vllm/pull/20682
  • @thechaos16 made their first contribution in https://github.com/vllm-project/vllm/pull/20807
  • @Wangmerlyn made their first contribution in https://github.com/vllm-project/vllm/pull/20857
  • @Liuchenlong made their first contribution in https://github.com/vllm-project/vllm/pull/20877
  • @vMaroon made their first contribution in https://github.com/vllm-project/vllm/pull/20511
  • @Dannyso05 made their first contribution in https://github.com/vllm-project/vllm/pull/20888
  • @ant-yy made their first contribution in https://github.com/vllm-project/vllm/pull/20680
  • @rsshaik1 made their first contribution in https://github.com/vllm-project/vllm/pull/20172
  • @jennifurhe made their first contribution in https://github.com/vllm-project/vllm/pull/20534
  • @tengyifei made their first contribution in https://github.com/vllm-project/vllm/pull/20415
  • @Relics made their first contribution in https://github.com/vllm-project/vllm/pull/20889
  • @dougbtv made their first contribution in https://github.com/vllm-project/vllm/pull/20943
  • @morgendave made their first contribution in https://github.com/vllm-project/vllm/pull/20591
  • @m-misiura made their first contribution in https://github.com/vllm-project/vllm/pull/20575
  • @nirda7 made their first contribution in https://github.com/vllm-project/vllm/pull/12010
  • @Kevin-XiongC made their first contribution in https://github.com/vllm-project/vllm/pull/21024
  • @ericcurtin made their first contribution in https://github.com/vllm-project/vllm/pull/21115
  • @shixianc made their first contribution in https://github.com/vllm-project/vllm/pull/20911
  • @Jialin made their first contribution in https://github.com/vllm-project/vllm/pull/21108
  • @JialinOuyang-Meta made their first contribution in https://github.com/vllm-project/vllm/pull/21005
  • @hax0r31337 made their first contribution in https://github.com/vllm-project/vllm/pull/21077
  • @Deepfocused made their first contribution in https://github.com/vllm-project/vllm/pull/21060
  • @fhl2000 made their first contribution in https://github.com/vllm-project/vllm/pull/21228
  • @alyosha-swamy made their first contribution in https://github.com/vllm-project/vllm/pull/21296
  • @bbartels made their first contribution in https://github.com/vllm-project/vllm/pull/21330
  • @Receiling made their first contribution in https://github.com/vllm-project/vllm/pull/21033
  • @ariG23498 made their first contribution in https://github.com/vllm-project/vllm/pull/21353
  • @xinli-git made their first contribution in https://github.com/vllm-project/vllm/pull/21325
  • @ranpox made their first contribution in https://github.com/vllm-project/vllm/pull/21396
  • @ericehanley made their first contribution in https://github.com/vllm-project/vllm/pull/21391
  • @sergiopaniego made their first contribution in https://github.com/vllm-project/vllm/pull/21441
  • @hardikkgupta made their first contribution in https://github.com/vllm-project/vllm/pull/20793
  • @deven-labovitch made their first contribution in https://github.com/vllm-project/vllm/pull/21374
  • @MatthewBonanni made their first contribution in https://github.com/vllm-project/vllm/pull/21468
  • @david6666666 made their first contribution in https://github.com/vllm-project/vllm/pull/21437
  • @KazusatoOoko made their first contribution in https://github.com/vllm-project/vllm/pull/21390

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.9.1...v0.10.0