fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

geniuspatrick · 2024-05-29T10:45:46Z

What does this PR do?

TL;DR

In a distributed training scenario, passing the argument --num_train_epochs to any of the training scripts disrupts the functioning of the learning rate scheduler. Essentially, the learning rate decays num_processes times slower than expected. Related issues #8236, #3954, and PR #3983 shed further light on this.

Explanation

In our training setup, we utilize accelerator instead of PyTorch's native DistributedSampler when creating the train_dataloader. This means we create the train_dataloader directly as if for standalone training and subsequently employ accelerator.prepare to shard the samples across different processes.

When referencing step in training scripts such as lr_warmup_steps, max_train_steps, etc., we're indicating the optimizing step. In essence, each step consumes num_processes * gradient_accumulation_steps batches of data. In the script, the learning rate scheduler is initialized before accelerator.prepare is called. At this stage, the train_dataloader hasn't yet sharded the samples, specifically the batched samples.

To accurately calculate num_update_steps_per_epoch, we need the length of the train_dataloader after distributed sharding. How do we achieve this? Typically, accelerator.prepare replaces train_dataloader.batch_sampler with BatchSamplerShard. The length of the distributed sharded train_dataloader (still a DataLoader instance) becomes the length of BatchSamplerShard. Hence, we derive a formula for estimating the length of the sharded train_dataloader, which aligns with current training scripts (where accelerator.prepare is called with no extra arguments).

As per accelerator principles, the prepared scheduler calls the step() of the unprepared scheduler num_processes times at each optimizing step (once gradient accumulation is completed). This necessitates dividing num_*_steps_for_scheduler by gradient_accumulation_steps and multiplying it by num_processes.

Feeling a bit confused? Not to worry, let's visualize it.

Experiments

We utilize Fine-tuning for text2image with LoRA as an example. Below is the training command:

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export DATASET_NAME="lambdalabs/naruto-blip-captions"

# Example of --num_train_epochs
accelerate launch examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=128 --random_flip --max_train_samples=171 \
  --train_batch_size=4 \
  --num_train_epochs=6 \
  --learning_rate=1e-04 --lr_scheduler="cosine_with_restarts" --lr_warmup_steps=3 \
  --gradient_accumulation_steps=5 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-epoch"

# Example of --max_train_steps
accelerate launch examples/text_to_image/train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --resolution=128 --random_flip --max_train_samples=171 \
  --train_batch_size=4 \
  --max_train_steps=30 \
  --learning_rate=1e-04 --lr_scheduler="cosine_with_restarts" --lr_warmup_steps=3 \
  --gradient_accumulation_steps=5 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-step"

The hyper-parameters are:

Batch size: 4
Number of processes (n_gpus): 2
Dataset length: 171
Gradient accumulation steps: 5

Thus,

len_dataloader_standalone = ceil(171/4) = 43
len_dataloader_distribute = ceil(43/2) = 22
num_update_steps_per_epoch = ceil(22/5) = 5

And epochs=6 is equivalent to steps=30.

Additionally, introducing the argument num_cycles=2 to the function get_scheduler exacerbates the error.

Before the PR

--num_train_epochs

--max_train_steps

After the PR

--num_train_epochs

--max_train_steps

Fixes #8236

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @eliphatfs

sayakpaul

I just have minor comments but this is very very nicely done. Thanks so much!

examples/text_to_image/train_text_to_image_lora.py

sayakpaul · 2024-05-29T11:05:44Z

examples/text_to_image/train_text_to_image_lora.py

+                "The length of the 'train_dataloader' after 'accelerator.prepare' does not match "
+                "the length that was expected when the learning rate scheduler was created. "
+                "This inconsistency may result in the learning rate scheduler not functioning properly."


Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?

Do we have drop_last settings or similar that may cause this to happen?

Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?

Yep, the values are included!

Do we have drop_last settings or similar that may cause this to happen?

For all current period training scripts, the answer is no. Our estimate of the length of the sliced dataloader is always correct, and this warning message is never triggered.

HuggingFaceDocBuilderDev · 2024-05-29T11:12:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2024-05-30T07:56:13Z

Sorry for pinging late but @geniuspatrick could we keep the changes in this PR to a bare minimum i.e., targeting a single script only and then opening the rest to the community? That will be easier to manage IMO.

geniuspatrick · 2024-05-30T08:03:24Z

Sorry for pinging late but @geniuspatrick could we keep the changes in this PR to a bare minimum i.e., targeting a single script only and then opening the rest to the community? That will be easier to manage IMO.

OK, I'll change the script examples/text_to_image/train_text_to_image_lora.py only.

…arios

geniuspatrick · 2024-05-30T08:22:51Z

Hi, @sayakpaul . I think it's ready now. Any suggestions?

sayakpaul

Thanks a ton!

geniuspatrick · 2024-05-31T01:55:21Z

Hi, @sayakpaul here's a TODO list for follow-up contributions from the community.

What should be changed

advanced_diffusion_training
- train_dreambooth_lora_sd15_advanced.py
- train_dreambooth_lora_sdxl_advanced.py
consistency_distillation
- train_lcm_distill_lora_sdxl.py
controlnet
- train_controlnet.py
- train_controlnet_sdxl.py
custom_diffusion
- train_custom_diffusion.py
dreambooth
- train_dreambooth.py
- train_dreambooth_lora.py
- train_dreambooth_lora_sdxl.py
instruct_pix2pix
- train_instruct_pix2pix.py
- rain_instruct_pix2pix_sdxl.py
kandinsky2_2/text_to_image
- train_text_to_image_decoder.py
- train_text_to_image_prior.py
- train_text_to_image_lora_decoder.py
- train_text_to_image_lora_prior.py
research_projects
- consistency_training/train_cm_ct_unconditional.py
- diffusion_dpo/train_diffusion_dpo.py
- diffusion_dpo/train_diffusion_dpo_sdxl.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
- dreambooth_inpaint/train_dreambooth_inpaint.py
- dreambooth_inpaint/train_dreambooth_inpaint_lora.py
- instructpix2pix_lora/train_instruct_pix2pix_lora.py
- intel_opts/textual_inversion/textual_inversion_bf16.py
- intel_opts/textual_inversion_dfq/textual_inversion.py
- lora/train_text_to_image_lora.py
- multi_subject_dreambooth/train_multi_subject_dreambooth.py
- multi_token_textual_inversion/textual_inversion.py
- onnxruntime/text_to_image/train_text_to_image.py
- onnxruntime/textual_inversion/textual_inversion.py
- onnxruntime/unconditional_image_generation/train_unconditional.py
- realfill/train_realfill.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py
t2i_adapter
- train_t2i_adapter_sdxl.py
text_to_image
- train_text_to_image.py
- train_text_to_image_sdxl.py
- train_text_to_image_lora.py
- train_text_to_image_lora_sdxl.py
textual_inversion
- textual_inversion.py
- textual_inversion_sdxl.py
unconditional_image_generation
- train_unconditional.py
wuerstchen
- text_to_image/train_text_to_image_prior.py
- text_to_image/train_text_to_image_lora_prior.py

What should NOT be changed

Category 1

The script does not have the argument --num_train_epochs.

amused
- train_amused.py
research_projects
- multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

Category 2

Distributed dataset sharding is done by WebDataset, not accelerator.

consistency_distillation
- train_lcm_distill_sd_wds.py
- train_lcm_distill_sdxl_wds.py
- train_lcm_distill_lora_sd_wds.py
- train_lcm_distill_lora_sdxl_wds.py
research_projects
- controlnet/train_controlnet_webdataset.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

BTW, if you need more extra hands, I would like to help!

sayakpaul · 2024-05-31T06:42:56Z

Great! Thank you so much, @geniuspatrick!

I will create an issue similar to #6545 so that the community can easily pick them up.

AbraarArique · 2024-07-27T17:46:35Z

Thank you @sayakpaul and @geniuspatrick for fixing this, much appreciated!

But I have a quick question: why is lr_warmup_steps multiplied by num_processes?

# Line 1092
num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes

For example, if my Dataloader length (e.g. steps per epoch) is 64, and for simplicity let's say I want to warm up to 32 steps.

With 1 epoch and gradient accumulation steps = 1, the current code works correctly for 1 GPU:

num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes # 32 * 1 = 32
len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes) # 64 / 1 = 64
num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps) # 64 / 1 = 64
num_training_steps_for_scheduler = (
    args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes # 1 * 64 * 1 = 64
)

But with 2 GPUs, the num_training_steps_for_scheduler stays the same (64), but num_warmup_steps_for_scheduler doubles to 64.

num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes # 32 * 2 = 64
len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes) # 64 / 2 = 32
num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps) # 32 / 1 = 32
num_training_steps_for_scheduler = (
    args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes # 1 * 32 * 2 = 64
)

Shouldn't num_warmup_steps_for_scheduler still be 32? Sorry if I'm missing something...

eliphatfs · 2024-07-27T23:08:30Z

In each gradient step, the lr scheduler is advanced for num_processes steps in accelerator, if my memory serves me right. This is counterintuitive.

AbraarArique · 2024-07-28T11:21:27Z

In each gradient step, the lr scheduler is advanced for num_processes steps in accelerator, if my memory serves me right. This is counterintuitive.

@eliphatfs Right, but to clarify, the num_training_steps_for_scheduler doesn't change much with the number of processes/GPUs, does it?

Regardless of 1, 2, or more GPUs, the num_training_steps_for_scheduler is still 64 in my example above.

So assuming I want to warm up for half of all steps, the num_warmup_steps_for_scheduler should be 32 regardless of the number of processes.

But the current code increases/scales the num_warmup_steps_for_scheduler based on num_processes, but the total num_training_steps_for_scheduler doesn't change. Shouldn't that break things?

bghira · 2024-09-02T03:23:55Z

this is actually incorrect...

The disconnected line is with the * accelerator.num_processes on the warmup steps resuming at 1000 steps with 3 GPUs.

The learning rate is resumed at the point where it would have been at T=333.

Removing the multiplication fixes the issue.

cc @linoytsaban @sayakpaul

sayakpaul · 2024-09-02T04:18:59Z

Perhaps you could provide a little more explanation here? From the snapshot, it's not immediately clear to me.

Update: I see what you mean. Yeah, IIUC, the multiplication (num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes) should be removed. PR?

Cc: @geniuspatrick

bghira · 2024-09-02T04:48:28Z

it might be a bug in Accelerate, actually. the LR scheduler state isn't being restored correctly when it restarts. I then set it manually, but didn't multiply by num_processes. which led to a mismatch during resume LR. but i think maybe when they are both multiplied correctly. this issue does not manifest. i'm still trying to test it so i can assess the issue on Accelerate's side since I shouldn't / didn't used to have to set the last_epoch manually.

sayakpaul · 2024-09-02T04:49:47Z

Cc: @muellerzr

geniuspatrick · 2024-09-02T06:56:42Z

@AbraarArique In the training scripts, we have two arguments that control the total number of training steps, args.num_train_epochs and args.max_train_steps(has a higher priority, btw). In your example, you seem to be controlling the number of steps trained through args.num_train_epochs. But note that as the number of gpu's changes, so does the number of training steps per epoch.

The argument args.lr_warmup_steps would be easier to understand, if together with args.max_train_steps. When the number of GPUs changes from 1 to 2, if you keep args.num_train_epochs=1, then actually the args.max_train_steps has changed from 64 to 32, then you should also adjust the args.lr_warmup_steps.

Hopefully the rather redundant explanation above will help your understanding.

geniuspatrick · 2024-09-02T07:10:56Z

it might be a bug in Accelerate, actually. the LR scheduler state isn't being restored correctly when it restarts. I then set it manually, but didn't multiply by num_processes. which led to a mismatch during resume LR. but i think maybe when they are both multiplied correctly. this issue does not manifest. i'm still trying to test it so i can assess the issue on Accelerate's side since I shouldn't / didn't used to have to set the last_epoch manually.

@bghira Looks like a topic about resume training. Is the scheduler state not being saved and restored correctly? Could the removal of multiplication be just a numerical coincidence? Is it possible to have the same problem with single GPU training?

bghira · 2024-09-02T11:22:43Z

on my end the problem is that we upgraded from accelerate v0.19 to v0.33 and the load/save state for accelerate stopped writing the step count for the random states - or stopped restoring it. either way, i'm on git main now and i have to see if lr_scheduler hasattr last_epoch and set it to resume_step * num_processes.

that fixed my issue for single and multiple GPUs. but i have to leave the num_warmup_steps multiplied too

muellerzr · 2024-09-02T11:46:34Z

Will dig into this @bghira

bghira · 2024-10-05T04:02:28Z

@muellerzr did you ever find anything? it might just be something i've been doing incorrectly but i would like to align with best practices

Zephyrose · 2024-10-13T07:45:35Z

@geniuspatrick @AbraarArique I agree with your point of view, I don't think we should multiply the accelerator.num_processes. Because the lr_warmup_steps/num_training_steps_for_scheduler is a ratio acctually. So no matter how many GPUs we use, we want this ratio to remain constant. When we add GPUs, we are actually increasing the batch size. num_training_steps_for_scheduler is a constant, and actual_training_steps is reduced by num_GPUS times. If we don't multiply the accelerator.num_processes, we will not adjust the args.lr_warmup_steps.

AbraarArique · 2024-10-13T17:40:12Z

@Zephyrose I think there are 2 ways of looking at this:

The way it's done now does make sense logically. If 1 epoch is 32 steps with 1 GPU, with 2 GPUs you're doubling the batch size and thus now you have 16 update steps per epoch.

So the lr_warmup_steps being relative to the number of update steps per device/process makes sense: as you add GPUs, you're lowering your update steps and thus lowering your lr_warmup_steps accordingly.

This works fine if you specify max_train_steps, but as you mentioned, if you're training based on epochs, having to scale the lr_warmup_steps down with more GPUs is indeed inconvenient as you often want the ratio to be consistent.

If people care about the warmup-to-total-steps ratio more than a specific number of steps, perhaps it makes sense to have an lr_warmup_ratio parameter instead of manually specifying steps?

…arios (#8312)

geniuspatrick changed the title ~~fix(training): lr scheduler doesn't work properly~~ [WIP] fix(training): lr scheduler doesn't work properly May 29, 2024

sayakpaul approved these changes May 29, 2024

View reviewed changes

geniuspatrick force-pushed the train_lr branch 3 times, most recently from 4b00cbf to f6742ea Compare May 30, 2024 05:08

geniuspatrick changed the title ~~[WIP] fix(training): lr scheduler doesn't work properly~~ [WIP] fix(training): lr scheduler doesn't work properly in distributed scenarios May 30, 2024

geniuspatrick force-pushed the train_lr branch from f6742ea to 30f3ab5 Compare May 30, 2024 07:47

geniuspatrick force-pushed the train_lr branch 3 times, most recently from 60f9f39 to 92f7262 Compare May 30, 2024 08:16

fix(training): lr scheduler doesn't work properly in distributed scen…

Verified

This commit was signed with the committer’s verified signature. The key has expired.

laravel-shift Laravel Shift

GPG key ID: 5A96F038425C5A1C
Expired

Verified
Learn about vigilant mode

7780eea

…arios

geniuspatrick force-pushed the train_lr branch from 92f7262 to 7780eea Compare May 30, 2024 08:18

geniuspatrick changed the title ~~[WIP] fix(training): lr scheduler doesn't work properly in distributed scenarios~~ fix(training): lr scheduler doesn't work properly in distributed scenarios May 30, 2024

sayakpaul approved these changes May 30, 2024

View reviewed changes

sayakpaul merged commit 3511a96 into huggingface:main May 30, 2024
8 checks passed

sayakpaul mentioned this pull request Jun 3, 2024

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Open

48 tasks

Nikolai10 mentioned this pull request Jun 13, 2024

Fix lr scheduler for case "if args.max_train_steps is None" Nikolai10/PerCo#2

Closed

a-r-r-o-w added a commit to a-r-r-o-w/diffusers that referenced this pull request Jun 30, 2024

apply changes from huggingface#8312

74c009f

sayakpaul pushed a commit that referenced this pull request Dec 23, 2024

fix(training): lr scheduler doesn't work properly in distributed scen…

a1cfb0a

…arios (#8312)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

geniuspatrick commented May 29, 2024 •

edited

Loading

sayakpaul left a comment

sayakpaul May 29, 2024

eliphatfs May 29, 2024

geniuspatrick May 30, 2024

geniuspatrick May 30, 2024

HuggingFaceDocBuilderDev commented May 29, 2024

sayakpaul commented May 30, 2024

geniuspatrick commented May 30, 2024

geniuspatrick commented May 30, 2024

sayakpaul left a comment

geniuspatrick commented May 31, 2024

sayakpaul commented May 31, 2024

AbraarArique commented Jul 27, 2024

eliphatfs commented Jul 27, 2024

AbraarArique commented Jul 28, 2024

bghira commented Sep 2, 2024

sayakpaul commented Sep 2, 2024 •

edited

Loading

bghira commented Sep 2, 2024

sayakpaul commented Sep 2, 2024

geniuspatrick commented Sep 2, 2024

geniuspatrick commented Sep 2, 2024 •

edited

Loading

bghira commented Sep 2, 2024

muellerzr commented Sep 2, 2024

bghira commented Oct 5, 2024

Zephyrose commented Oct 13, 2024

AbraarArique commented Oct 13, 2024

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

fix(training): lr scheduler doesn't work properly in distributed scenarios #8312

Conversation

geniuspatrick commented May 29, 2024 • edited Loading

What does this PR do?

TL;DR

Explanation

Experiments

Before the PR

After the PR

Before submitting

Who can review?

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul May 29, 2024

Choose a reason for hiding this comment

eliphatfs May 29, 2024

Choose a reason for hiding this comment

geniuspatrick May 30, 2024

Choose a reason for hiding this comment

geniuspatrick May 30, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented May 29, 2024

sayakpaul commented May 30, 2024

geniuspatrick commented May 30, 2024

geniuspatrick commented May 30, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

geniuspatrick commented May 31, 2024

What should be changed

What should NOT be changed

Category 1

Category 2

sayakpaul commented May 31, 2024

AbraarArique commented Jul 27, 2024

eliphatfs commented Jul 27, 2024

AbraarArique commented Jul 28, 2024

bghira commented Sep 2, 2024

sayakpaul commented Sep 2, 2024 • edited Loading

bghira commented Sep 2, 2024

sayakpaul commented Sep 2, 2024

geniuspatrick commented Sep 2, 2024

geniuspatrick commented Sep 2, 2024 • edited Loading

bghira commented Sep 2, 2024

muellerzr commented Sep 2, 2024

bghira commented Oct 5, 2024

Zephyrose commented Oct 13, 2024

AbraarArique commented Oct 13, 2024

geniuspatrick commented May 29, 2024 •

edited

Loading

sayakpaul commented Sep 2, 2024 •

edited

Loading

geniuspatrick commented Sep 2, 2024 •

edited

Loading