[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

varun-sundar-rabindranath · 2025-03-11T16:57:36Z

Enable CUDAGraphs support for V1

github-actions · 2025-03-11T16:57:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath · 2025-03-11T19:07:48Z

vllm/lora/layers.py

+                                        1, 0)
+        embeddings_indices = torch.narrow(
+            self.punica_wrapper._embeddings_indices, 1, 0, x.size(0))
+


^ changes are to avoid errors such as,

raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['input_ids'].size()[0], L['positions'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['input_ids'].size()[0]) are valid because L['input_ids'].size()[0] was inferred to be a constant (8192). - Not all values of RelaxedUnspecConstraint(L['positions'].size()[0]) are valid because L['positions'].size()[0] was inferred to be a constant (8192).

varun-sundar-rabindranath · 2025-03-11T20:04:20Z

vllm/lora/layers.py

-        full_output = self.base_layer.forward(
-            x.add_(indices * added_tokens_mask))
+        full_output = self.base_layer.forward(x +
+                                              (indices * added_tokens_mask))


x here is the input_ids. In V1, we don't zero out the cuda graph pad region.
Avoid the in-place update here to prevent accumulating garbage into the input buffer.

varun-sundar-rabindranath · 2025-03-11T20:08:56Z

vllm/config.py

+            vllm_factors.append(
+                hashlib.md5(
+                    str(self.scheduler_config.max_num_batched_tokens).encode()
+                ).hexdigest())


During torch.compile, LoRA static buffers like in

vllm/vllm/lora/punica_wrapper/punica_base.py

Line 133 in 5305673

self._token_lora_indices = torch.empty(max_num_batched_tokens,

and

vllm/vllm/lora/ops/triton_ops/v1/v1_kernel_metadata.py

Line 24 in 5305673

token_lora_mapping = torch.empty(max_num_tokens,

get captured along with their sizes and strides (they aren't dynamic)

When max_num_batched_tokens changes, and when the captured graph is executed, we hit assert_size_stride asserts on these tensors. As a solution, we simply recompile when max_num_batched_tokens change.

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 11, 2025 16:57

varun-sundar-rabindranath mentioned this pull request Mar 11, 2025

[Do Not Merge] - LoRA V1 Reference PR #11613

Draft

mergify bot added the v1 label Mar 11, 2025

add cudagraph support

f92f3e2

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath force-pushed the varun/v1-lora-cudagraph branch from 544b16c to f92f3e2 Compare March 11, 2025 19:01

varun-sundar-rabindranath commented Mar 11, 2025

View reviewed changes

varun-sundar-rabindranath mentioned this pull request Mar 12, 2025

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

varun-sundar-rabindranath commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

Are you sure you want to change the base?

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

Conversation

varun-sundar-rabindranath commented Mar 11, 2025

github-actions bot commented Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

Choose a reason for hiding this comment

varun-sundar-rabindranath Mar 11, 2025

Choose a reason for hiding this comment

varun-sundar-rabindranath Mar 11, 2025

Choose a reason for hiding this comment