Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

Open
1 task done
rtx-8000 opened this issue Mar 7, 2025 · 7 comments
Open
1 task done

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

rtx-8000 opened this issue Mar 7, 2025 · 7 comments
Assignees
Labels
usage How to use vllm

Comments

@rtx-8000
Copy link

rtx-8000 commented Mar 7, 2025

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

I've noticed that using LoRA with rank=256 significantly slows down inference by 4x, as shown below. However, reducing the rank to 8 or 16 brings performance closer to that of no LoRA. I'm currently using two fully-utilized GPUs, without the enforce_eager flag, and have set the maximum LoRA rank accordingly. Interestingly, adjusting the maximum model length had no impact on performance. What steps can I take to optimize performance?

No Lora

Processed prompts: 0%|▏ | 5/2430 [01:28<6:58:39, 10.36s/it, est. speed input: 3.71 toks/s, output: 2.34 toks/s]Processed prompts: 10%|█████▊ | 240/2430 [05:09<44:09, 1.21s/it, est. speed input: 87.79 toks/s, output: 90.18 toks/s]WARNING 03-06 17:12:30 scheduler.py:1754] Sequence group 352 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▏ | 476/2430 [09:38<39:30, 1.21s/it, est. speed input: 106.63 toks/s, output: 117.32 toks/s]^

Lora rank = 16

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-07 11:35:15 scheduler.py:1754] Sequence group 238 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 3/2430 [01:24<13:43:22, 20.36s/it, est. speed input: 2.31 toks/s, output: 1.25 toks/s]WARNING 03-07 11:36:05 scheduler.py:1754] Sequence group 187 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 11%|██████▎ | 262/2430 [06:11<42:31, 1.18s/it, est. speed input: 84.40 toks/s, output: 88.40 toks/s]WARNING 03-07 11:40:46 scheduler.py:1754] Sequence group 342 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=101
Processed prompts: 18%|██████████▍ | 437/2430 [10:07<43:53, 1.32s/it, est. speed input: 96.26 toks/s, output: 105.08 toks/s]WARNING 03-07 11:44:38 scheduler.py:1754] Sequence group 569 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=151

Lora rank = 256

Processed prompts: 0%| | 0/2430 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 03-06 17:25:54 scheduler.py:1754] Sequence group 255 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
Processed prompts: 0%| | 4/2430 [02:52<20:13:48, 30.02s/it, est. speed input: 1.50 toks/s, output: 0.86 toks/s]Processed prompts: 10%|█████▊ | 246/2430 [10:13<1:19:59, 2.20s/it, est. speed input: 45.74 toks/s, output: 46.86 toks/s]WARNING 03-06 17:34:07 scheduler.py:1754] Sequence group 356 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=51
Processed prompts: 20%|███████████▌ | 476/2430 [18:01<47:13, 1.45s/it, est. speed input: 57.00 toks/s, output: 61.91 toks/s]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@rtx-8000 rtx-8000 added the usage How to use vllm label Mar 7, 2025
@rtx-8000 rtx-8000 changed the title [Usage]: VLLM Inference - 4x slower with LoRA rank=256 vs none. [Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. Mar 7, 2025
@jeejeelee
Copy link
Collaborator

We have a benchmark result at slack_lora_thread. We are aware of this issue and will optimizing the lora performance. Could you please provide your model and LoRA config?

@rtx-8000
Copy link
Author

rtx-8000 commented Mar 7, 2025

Model: llama-3.3-70b-instruct-awq
LoRA config:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "/u01/app/mlo/models/Llama-3.3-70B-Instruct",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layer_replication": null,
  "layers_pattern": null,
  "layers_to_transform": null,
  "loftq_config": {},
  "lora_alpha": 512,
  "lora_dropout": 0.0,
  "megatron_config": null,
  "megatron_core": "megatron.core",
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 256,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "gate_proj",
    "v_proj",
    "down_proj",
    "q_proj",
    "k_proj",
    "o_proj",
    "up_proj"
  ],
  "task_type": "CAUSAL_LM",
  "use_dora": false,
  "use_rslora": false
}

@rtx-8000
Copy link
Author

Hello,
I ran more tests setting max_num_seqs to 1.
I am now getting worse result for adapter of rank 16 than for rank 256. They both have the same configuration, i.e. same target_modules.

@varun-sundar-rabindranath
Copy link
Contributor

Hi @rtx-8000, if can you can use the nightly, can you try setting the environment variable VLLM_USE_V1=1 .
for example,

 VLLM_USE_V1="1" vllm serve  meta-llama/Llama-2-7b-hf --enable-lora --max-loras 4 --max-lora-rank 256 --lora-modules "lora0"="yard1/llama-2-7b-sql-lora-test" "lora1"="yard1/llama-2-7b-sql-lora-test" "lora2"="yard1/llama-2-7b-sql-lora-test" "lora3"="yard1/llama-2-7b-sql-lora-test"

I see that for lower ranks, VLLM_USE_V1="0" is slightly better. But VLLM_USE_V1="1" doesn't seem to be affected by the max-lora-rank as much. #14626 should make the low rank case better 🤞
cc @jeejeelee

@rtx-8000
Copy link
Author

rtx-8000 commented Mar 12, 2025

Thank you for your comment. When running the following command:

VLLM_USE_V1="1" python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 700, max_seq_len_to_capture: 700, max_lora_rank:256, max_num_seqs: 1}"

I got the following errors:

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:00 utils.py:608] Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9

(VllmWorker rank=0 pid=345921) ERROR 03-12 08:59:39 multiproc_executor.py:374] ValueError: Unsupported FA version: None

FYI, vllm_infer.py.

@jeejeelee
Copy link
Collaborator

I will test #14626 asap, and will provibe the test result here. @rtx-8000 @varun-sundar-rabindranath

@rtx-8000
Copy link
Author

Thank you @jeejeelee .
I also noticed something weird, I am testing on 2 datasets with one with longer sequences. When I run the model with Lora on shorter sequence the model is like 2/3 times slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants