[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

rtx-8000 · 2025-03-07T11:58:26Z

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I've noticed that using LoRA with rank=256 significantly slows down inference by 4x, as shown below. However, reducing the rank to 8 or 16 brings performance closer to that of no LoRA. I'm currently using two fully-utilized GPUs, without the enforce_eager flag, and have set the maximum LoRA rank accordingly. Interestingly, adjusting the maximum model length had no impact on performance. What steps can I take to optimize performance?

No Lora

Processed prompts: 0%|▏ | 5/2430 [01:28<6:58:39, 10.36s/it, est. speed input: 3.71 toks/s, output: 2.34 toks/s]Processed prompts: 10%|█████▊ | 240/2430 [05:09<44:09, 1.21s/it, est. speed input: 87.79 toks/s, output: 90.18 toks/s]WARNING 03-06 17:12:30 scheduler.py:1754] Sequence group 352 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▏ | 476/2430 [09:38<39:30, 1.21s/it, est. speed input: 106.63 toks/s, output: 117.32 toks/s]^

Lora rank = 16

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-07 11:35:15 scheduler.py:1754] Sequence group 238 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 3/2430 [01:24<13:43:22, 20.36s/it, est. speed input: 2.31 toks/s, output: 1.25 toks/s]WARNING 03-07 11:36:05 scheduler.py:1754] Sequence group 187 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 11%|██████▎ | 262/2430 [06:11<42:31, 1.18s/it, est. speed input: 84.40 toks/s, output: 88.40 toks/s]WARNING 03-07 11:40:46 scheduler.py:1754] Sequence group 342 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101
Processed prompts: 18%|██████████▍ | 437/2430 [10:07<43:53, 1.32s/it, est. speed input: 96.26 toks/s, output: 105.08 toks/s]WARNING 03-07 11:44:38 scheduler.py:1754] Sequence group 569 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151

Lora rank = 256

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-06 17:25:54 scheduler.py:1754] Sequence group 255 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 4/2430 [02:52<20:13:48, 30.02s/it, est. speed input: 1.50 toks/s, output: 0.86 toks/s]Processed prompts: 10%|█████▊ | 246/2430 [10:13<1:19:59, 2.20s/it, est. speed input: 45.74 toks/s, output: 46.86 toks/s]WARNING 03-06 17:34:07 scheduler.py:1754] Sequence group 356 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▌ | 476/2430 [18:01<47:13, 1.45s/it, est. speed input: 57.00 toks/s, output: 61.91 toks/s]

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

jeejeelee · 2025-03-07T14:05:54Z

We have a benchmark result at slack_lora_thread. We are aware of this issue and will optimizing the lora performance. Could you please provide your model and LoRA config?

rtx-8000 · 2025-03-07T14:12:06Z

Model: llama-3.3-70b-instruct-awq
LoRA config:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "/u01/app/mlo/models/Llama-3.3-70B-Instruct",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 512,
  "lora_dropout": 0.0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 256,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "gate_proj",
    "v_proj",
    "down_proj",
    "q_proj",
    "k_proj",
    "o_proj",
    "up_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

rtx-8000 · 2025-03-11T16:42:36Z

Hello,
I ran more tests setting max_num_seqs to 1.
I am now getting worse result for adapter of rank 16 than for rank 256. They both have the same configuration, i.e. same target_modules.

varun-sundar-rabindranath · 2025-03-12T03:51:56Z

Hi @rtx-8000, if can you can use the nightly, can you try setting the environment variable VLLM_USE_V1=1 .
for example,

 VLLM_USE_V1="1" vllm serve  meta-llama/Llama-2-7b-hf --enable-lora --max-loras 4 --max-lora-rank 256 --lora-modules "lora0"="yard1/llama-2-7b-sql-lora-test" "lora1"="yard1/llama-2-7b-sql-lora-test" "lora2"="yard1/llama-2-7b-sql-lora-test" "lora3"="yard1/llama-2-7b-sql-lora-test"

I see that for lower ranks, VLLM_USE_V1="0" is slightly better. But VLLM_USE_V1="1" doesn't seem to be affected by the max-lora-rank as much. #14626 should make the low rank case better 🤞
cc @jeejeelee

rtx-8000 · 2025-03-12T08:00:44Z

Thank you for your comment. When running the following command:

VLLM_USE_V1="1" python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 700, max_seq_len_to_capture: 700, max_lora_rank:256, max_num_seqs: 1}"

I got the following errors:

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:00 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:39 multiproc_executor.py:374] ValueError: Unsupported FA version: None

FYI, vllm_infer.py.

jeejeelee · 2025-03-12T08:26:17Z

I will test #14626 asap, and will provibe the test result here. @rtx-8000 @varun-sundar-rabindranath

rtx-8000 · 2025-03-12T12:09:07Z

Thank you @jeejeelee .
I also noticed something weird, I am testing on 2 datasets with one with longer sequences. When I run the model with Lora on shorter sequence the model is like 2/3 times slower.

rtx-8000 added the usage How to use vllm label Mar 7, 2025

DarkLight1337 assigned jeejeelee Mar 7, 2025

rtx-8000 changed the title ~~[Usage]: VLLM Inference - 4x slower with LoRA rank=256 vs none.~~ [Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

rtx-8000 commented Mar 7, 2025 •

edited

Loading

jeejeelee commented Mar 7, 2025

rtx-8000 commented Mar 7, 2025

rtx-8000 commented Mar 11, 2025

varun-sundar-rabindranath commented Mar 12, 2025

rtx-8000 commented Mar 12, 2025 •

edited

Loading

jeejeelee commented Mar 12, 2025

rtx-8000 commented Mar 12, 2025

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

Comments

rtx-8000 commented Mar 7, 2025 • edited Loading

Your current environment

How would you like to use vllm

Before submitting a new issue...

jeejeelee commented Mar 7, 2025

rtx-8000 commented Mar 7, 2025

rtx-8000 commented Mar 11, 2025

varun-sundar-rabindranath commented Mar 12, 2025

rtx-8000 commented Mar 12, 2025 • edited Loading

jeejeelee commented Mar 12, 2025

rtx-8000 commented Mar 12, 2025

rtx-8000 commented Mar 7, 2025 •

edited

Loading

rtx-8000 commented Mar 12, 2025 •

edited

Loading