-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435
Comments
We have a benchmark result at slack_lora_thread. We are aware of this issue and will optimizing the lora performance. Could you please provide your model and LoRA config? |
Model: llama-3.3-70b-instruct-awq {
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "/u01/app/mlo/models/Llama-3.3-70B-Instruct",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layer_replication": null,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 512,
"lora_dropout": 0.0,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": null,
"peft_type": "LORA",
"r": 256,
"rank_pattern": {},
"revision": null,
"target_modules": [
"gate_proj",
"v_proj",
"down_proj",
"q_proj",
"k_proj",
"o_proj",
"up_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": false
} |
Hello, |
Hi @rtx-8000, if can you can use the nightly, can you try setting the environment variable
I see that for lower ranks, |
Thank you for your comment. When running the following command:
I got the following errors:
FYI, vllm_infer.py. |
I will test #14626 asap, and will provibe the test result here. @rtx-8000 @varun-sundar-rabindranath |
Thank you @jeejeelee . |
Your current environment
How would you like to use vllm
I've noticed that using LoRA with rank=256 significantly slows down inference by 4x, as shown below. However, reducing the rank to 8 or 16 brings performance closer to that of no LoRA. I'm currently using two fully-utilized GPUs, without the enforce_eager flag, and have set the maximum LoRA rank accordingly. Interestingly, adjusting the maximum model length had no impact on performance. What steps can I take to optimize performance?
No Lora
Processed prompts: 0%|▏ | 5/2430 [01:28<6:58:39, 10.36s/it, est. speed input: 3.71 toks/s, output: 2.34 toks/s]Processed prompts: 10%|█████▊ | 240/2430 [05:09<44:09, 1.21s/it, est. speed input: 87.79 toks/s, output: 90.18 toks/s]WARNING 03-06 17:12:30 scheduler.py:1754] Sequence group 352 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▏ | 476/2430 [09:38<39:30, 1.21s/it, est. speed input: 106.63 toks/s, output: 117.32 toks/s]^
Lora rank = 16
Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-07 11:35:15 scheduler.py:1754] Sequence group 238 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 3/2430 [01:24<13:43:22, 20.36s/it, est. speed input: 2.31 toks/s, output: 1.25 toks/s]WARNING 03-07 11:36:05 scheduler.py:1754] Sequence group 187 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 11%|██████▎ | 262/2430 [06:11<42:31, 1.18s/it, est. speed input: 84.40 toks/s, output: 88.40 toks/s]WARNING 03-07 11:40:46 scheduler.py:1754] Sequence group 342 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101
Processed prompts: 18%|██████████▍ | 437/2430 [10:07<43:53, 1.32s/it, est. speed input: 96.26 toks/s, output: 105.08 toks/s]WARNING 03-07 11:44:38 scheduler.py:1754] Sequence group 569 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151
Lora rank = 256
Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-06 17:25:54 scheduler.py:1754] Sequence group 255 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 4/2430 [02:52<20:13:48, 30.02s/it, est. speed input: 1.50 toks/s, output: 0.86 toks/s]Processed prompts: 10%|█████▊ | 246/2430 [10:13<1:19:59, 2.20s/it, est. speed input: 45.74 toks/s, output: 46.86 toks/s]WARNING 03-06 17:34:07 scheduler.py:1754] Sequence group 356 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▌ | 476/2430 [18:01<47:13, 1.45s/it, est. speed input: 57.00 toks/s, output: 61.91 toks/s]
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: