Mseznec/flash attention fp8 #14570

mickaelseznec · 2025-03-10T15:40:03Z

This PR add support for FP8 KV cache with FlashAttention3 (related PR in flash-attn here) cc @LucasWilkinson Please do not merge this PR as long as it's not referencing vllm-project/flash-attention yet.

FlashAttention (contrary to FlashInfer) does attention with all Q, K and V in FP8.
The performance is usually better than FlashInfer FP8 KV and FlashAttention 3 with bf16.

I added support for v0 and v1 + some unit testing.

Note that I've added a trick for checkpoints not providing q_scale and reuse the k_scale (with is something TRTLLM does fwiw).

Also: I added a small QoS improvement when debugging v1: workers send back their traceback when they raise an exception.

github-actions · 2025-03-10T15:40:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

mickaelseznec · 2025-03-11T09:30:52Z

CI failing because vllm/tests/entrypoints/openai/test_accuracy.py from here doesn't exist.

@robertgshaw2-redhat any idea how should I fix? Just rename in run-tpu-test.sh? (@NickLucche you moved the file)

NickLucche · 2025-03-11T11:43:04Z

This is a known issue, PR addressing it here #13898. It won't block your PR.

NickLucche · 2025-03-11T11:45:19Z

I see there's some other problem with building the image, but likely CI just needs another spin

LucasWilkinson · 2025-03-11T18:05:41Z

@mickaelseznec apologies for the delay, vllm-project/flash-attention#50 (review) has been merged, you can now point to vllm_flash_attn

We will need to populate the sccache on the server to get it through the CI, I can help with this once the tag is updated 👍

LucasWilkinson

Thanks for the contribution! Looks clean 😄, ill approve once we can get it updated to use vllm_flash_attn, added a couple comments

LucasWilkinson · 2025-03-11T18:10:34Z

tests/kernels/test_flash_attn.py

+
+        q_descale = q_scale.expand((num_seqs, num_kv_heads))
+        k_descale = k_scale.expand((num_seqs, num_kv_heads))
+        v_descale = v_scale.expand((num_seqs, num_kv_heads))


nit: could we maybe test per-head scales here too?, i.e. also test with non-zero strides

LucasWilkinson · 2025-03-11T18:16:47Z

vllm/platforms/cuda.py

@@ -240,15 +240,6 @@ def get_attn_backend_cls(cls, selected_backend, head_size, dtype,
                "Cannot use FlashAttention-2 backend for dtype other than "
                "torch.float16 or torch.bfloat16.")
            target_backend = _Backend.XFORMERS
-        elif kv_cache_dtype is not None and \


we should keep this check but restrict it to FA2, i.e. check get_flash_attn_version() != 2 (get_flash_attn_version() is in vllm/attention/backends/utils.py)

mickaelseznec requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, mgoin, zhuohan123 and youkaichao as code owners March 10, 2025 15:40

mergify bot added ci/build v1 labels Mar 10, 2025

mickaelseznec added 4 commits March 10, 2025 15:41

[Feature] Add support for Flash-Attn FP8

f9e953f

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

[Feature] executors send back traceback

e684323

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

[FIXME] temp. use up-to-date flashattn

4294035

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

feat: add unit test for FA3 FP8

2b985ed

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

mickaelseznec force-pushed the mseznec/flash-attention-fp8 branch from bc909f9 to 2b985ed Compare March 10, 2025 15:41

fix: ruff

263bd81

Signed-off-by: Mickael Seznec <mickael@mistral.ai>

robertgshaw2-redhat requested a review from LucasWilkinson March 10, 2025 17:03

robertgshaw2-redhat assigned LucasWilkinson Mar 10, 2025

LucasWilkinson reviewed Mar 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mseznec/flash attention fp8 #14570

Mseznec/flash attention fp8 #14570

mickaelseznec commented Mar 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 10, 2025

mickaelseznec commented Mar 11, 2025

NickLucche commented Mar 11, 2025

NickLucche commented Mar 11, 2025

LucasWilkinson commented Mar 11, 2025

LucasWilkinson left a comment •

edited

Loading

LucasWilkinson Mar 11, 2025

LucasWilkinson Mar 11, 2025

Mseznec/flash attention fp8 #14570

Are you sure you want to change the base?

Mseznec/flash attention fp8 #14570

Conversation

mickaelseznec commented Mar 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 10, 2025

mickaelseznec commented Mar 11, 2025

NickLucche commented Mar 11, 2025

NickLucche commented Mar 11, 2025

LucasWilkinson commented Mar 11, 2025

LucasWilkinson left a comment • edited Loading

Choose a reason for hiding this comment

LucasWilkinson Mar 11, 2025

Choose a reason for hiding this comment

LucasWilkinson Mar 11, 2025

Choose a reason for hiding this comment

mickaelseznec commented Mar 10, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading