[Kernel] GGUF MoE kernel #14613

SzymonOzog · 2025-03-11T13:04:10Z

Overally speeds up DeepSeek GGUF and enables graph caching. Jumps from 10 to 50 tok/s on 8xH100 for Q4_K quants

Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

github-actions · 2025-03-11T13:04:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Isotr0py

Overall LGTM! Just some nits about ops registration, PTAL!

vllm/model_executor/layers/quantization/gguf.py

Isotr0py · 2025-03-11T15:16:59Z

csrc/quantization/gguf/moe.cuh

+template <typename scalar_t, int qk, int qr, int qi, bool need_sum,
+          typename block_q_t, int mmq_x, int mmq_y, int nwarps,
+          allocate_tiles_cuda_t allocate_tiles, load_tiles_cuda_t load_tiles,
+          int vdr, vec_dot_q_mul_mat_cuda_t vec_dot>
+static __device__ __forceinline__ void moe_q(


Is this file adapted/copied from somewhere? If so, we need to add the source of it for easier maintenance.

Just adapted from the mmq kernel that's already in the repo, not sure if I should mention that

I think it's still fine to mention it since there's no such kernel in llama.cpp, so that other developers interested in this kernel won't be confused. :)

Sure thing, added paths to both files I took inspiration from

Isotr0py · 2025-03-11T15:37:27Z

vllm/model_executor/layers/quantization/gguf.py

+    else:
+        for tok, (w, idx) in enumerate(zip(topk_weights, topk_ids)):


Can you add a warning about performance degradation for this fallback if user using i-matrix?

Good idea, added a warning

Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

mgoin

Amazing achievement! We should explore evals and benchmarks to detail the compression tradeoffs for users

gguf moe kernel

ae89ce4

Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

SzymonOzog requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners March 11, 2025 13:04

Isotr0py approved these changes Mar 11, 2025

View reviewed changes

Isotr0py reviewed Mar 11, 2025

View reviewed changes

Incorporate feedback

f78ec33

Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>

Isotr0py enabled auto-merge (squash) March 11, 2025 16:20

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025

mgoin approved these changes Mar 11, 2025

View reviewed changes

Isotr0py merged commit e22ee1e into vllm-project:main Mar 12, 2025
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] GGUF MoE kernel #14613

[Kernel] GGUF MoE kernel #14613

SzymonOzog commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

Isotr0py left a comment

Isotr0py Mar 11, 2025

SzymonOzog Mar 11, 2025

Isotr0py Mar 11, 2025

SzymonOzog Mar 11, 2025

Isotr0py Mar 11, 2025

SzymonOzog Mar 11, 2025

mgoin left a comment

		else:
		for tok, (w, idx) in enumerate(zip(topk_weights, topk_ids)):

[Kernel] GGUF MoE kernel #14613

[Kernel] GGUF MoE kernel #14613

Conversation

SzymonOzog commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

Isotr0py left a comment

Choose a reason for hiding this comment

Isotr0py Mar 11, 2025

Choose a reason for hiding this comment

SzymonOzog Mar 11, 2025

Choose a reason for hiding this comment

Isotr0py Mar 11, 2025

Choose a reason for hiding this comment

SzymonOzog Mar 11, 2025

Choose a reason for hiding this comment

Isotr0py Mar 11, 2025

Choose a reason for hiding this comment

SzymonOzog Mar 11, 2025

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment