-
Notifications
You must be signed in to change notification settings - Fork 5.9k
fix(training): lr scheduler doesn't work properly in distributed scenarios #8312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have minor comments but this is very very nicely done. Thanks so much!
"The length of the 'train_dataloader' after 'accelerator.prepare' does not match " | ||
"the length that was expected when the learning rate scheduler was created. " | ||
"This inconsistency may result in the learning rate scheduler not functioning properly." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have drop_last settings or similar that may cause this to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also include the values or "The length of the 'train_dataloader'" and "the length that was expected when the learning rate scheduler was created"?
Yep, the values are included!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have drop_last settings or similar that may cause this to happen?
For all current period training scripts, the answer is no. Our estimate of the length of the sliced dataloader is always correct, and this warning message is never triggered.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
4b00cbf
to
f6742ea
Compare
Sorry for pinging late but @geniuspatrick could we keep the changes in this PR to a bare minimum i.e., targeting a single script only and then opening the rest to the community? That will be easier to manage IMO. |
OK, I'll change the script |
60f9f39
to
92f7262
Compare
Hi, @sayakpaul . I think it's ready now. Any suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton!
Hi, @sayakpaul here's a TODO list for follow-up contributions from the community. What should be changed
What should NOT be changedCategory 1The script does not have the argument --num_train_epochs.
Category 2Distributed dataset sharding is done by WebDataset, not accelerator.
BTW, if you need more extra hands, I would like to help! |
Great! Thank you so much, @geniuspatrick! I will create an issue similar to #6545 so that the community can easily pick them up. |
Thank you @sayakpaul and @geniuspatrick for fixing this, much appreciated! But I have a quick question: why is # Line 1092
num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes For example, if my Dataloader length (e.g. steps per epoch) is 64, and for simplicity let's say I want to warm up to 32 steps. With 1 epoch and gradient accumulation steps = 1, the current code works correctly for 1 GPU: num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes # 32 * 1 = 32
len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes) # 64 / 1 = 64
num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps) # 64 / 1 = 64
num_training_steps_for_scheduler = (
args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes # 1 * 64 * 1 = 64
) But with 2 GPUs, the num_warmup_steps_for_scheduler = args.lr_warmup_steps * accelerator.num_processes # 32 * 2 = 64
len_train_dataloader_after_sharding = math.ceil(len(train_dataloader) / accelerator.num_processes) # 64 / 2 = 32
num_update_steps_per_epoch = math.ceil(len_train_dataloader_after_sharding / args.gradient_accumulation_steps) # 32 / 1 = 32
num_training_steps_for_scheduler = (
args.num_train_epochs * num_update_steps_per_epoch * accelerator.num_processes # 1 * 32 * 2 = 64
) Shouldn't |
In each gradient step, the lr scheduler is advanced for num_processes steps in accelerator, if my memory serves me right. This is counterintuitive. |
@eliphatfs Right, but to clarify, the Regardless of 1, 2, or more GPUs, the So assuming I want to warm up for half of all steps, the But the current code increases/scales the |
this is actually incorrect... The disconnected line is with the The learning rate is resumed at the point where it would have been at T=333. Removing the multiplication fixes the issue. |
Perhaps you could provide a little more explanation here? From the snapshot, it's not immediately clear to me. Update: I see what you mean. Yeah, IIUC, the multiplication ( Cc: @geniuspatrick |
it might be a bug in Accelerate, actually. the LR scheduler state isn't being restored correctly when it restarts. I then set it manually, but didn't multiply by num_processes. which led to a mismatch during resume LR. but i think maybe when they are both multiplied correctly. this issue does not manifest. i'm still trying to test it so i can assess the issue on Accelerate's side since I shouldn't / didn't used to have to set the last_epoch manually. |
Cc: @muellerzr |
@AbraarArique In the training scripts, we have two arguments that control the total number of training steps, The argument Hopefully the rather redundant explanation above will help your understanding. |
@bghira Looks like a topic about resume training. Is the scheduler state not being saved and restored correctly? Could the removal of multiplication be just a numerical coincidence? Is it possible to have the same problem with single GPU training? |
on my end the problem is that we upgraded from accelerate v0.19 to v0.33 and the load/save state for accelerate stopped writing the step count for the random states - or stopped restoring it. either way, i'm on git main now and i have to see if lr_scheduler hasattr last_epoch and set it to resume_step * num_processes. that fixed my issue for single and multiple GPUs. but i have to leave the num_warmup_steps multiplied too |
Will dig into this @bghira |
@muellerzr did you ever find anything? it might just be something i've been doing incorrectly but i would like to align with best practices |
@geniuspatrick @AbraarArique I agree with your point of view, I don't think we should multiply the accelerator.num_processes. Because the lr_warmup_steps/num_training_steps_for_scheduler is a ratio acctually. So no matter how many GPUs we use, we want this ratio to remain constant. When we add GPUs, we are actually increasing the batch size. num_training_steps_for_scheduler is a constant, and actual_training_steps is reduced by num_GPUS times. If we don't multiply the accelerator.num_processes, we will not adjust the args.lr_warmup_steps. |
@Zephyrose I think there are 2 ways of looking at this: The way it's done now does make sense logically. If 1 epoch is 32 steps with 1 GPU, with 2 GPUs you're doubling the batch size and thus now you have 16 update steps per epoch. So the This works fine if you specify If people care about the warmup-to-total-steps ratio more than a specific number of steps, perhaps it makes sense to have an |
What does this PR do?
TL;DR
In a distributed training scenario, passing the argument
--num_train_epochs
to any of the training scripts disrupts the functioning of the learning rate scheduler. Essentially, the learning rate decaysnum_processes
times slower than expected. Related issues #8236, #3954, and PR #3983 shed further light on this.Explanation
In our training setup, we utilize
accelerator
instead of PyTorch's native DistributedSampler when creating thetrain_dataloader
. This means we create thetrain_dataloader
directly as if for standalone training and subsequently employaccelerator.prepare
to shard the samples across different processes.When referencing
step
in training scripts such aslr_warmup_steps
,max_train_steps
, etc., we're indicating the optimizing step. In essence, each step consumesnum_processes * gradient_accumulation_steps
batches of data. In the script, the learning rate scheduler is initialized beforeaccelerator.prepare
is called. At this stage, thetrain_dataloader
hasn't yet sharded the samples, specifically the batched samples.To accurately calculate
num_update_steps_per_epoch
, we need the length of thetrain_dataloader
after distributed sharding. How do we achieve this? Typically,accelerator.prepare
replacestrain_dataloader.batch_sampler
with BatchSamplerShard. The length of the distributed shardedtrain_dataloader
(still a DataLoader instance) becomes the length ofBatchSamplerShard
. Hence, we derive a formula for estimating the length of the shardedtrain_dataloader
, which aligns with current training scripts (whereaccelerator.prepare
is called with no extra arguments).As per
accelerator
principles, the prepared scheduler calls thestep()
of the unprepared schedulernum_processes
times at each optimizing step (once gradient accumulation is completed). This necessitates dividingnum_*_steps_for_scheduler
bygradient_accumulation_steps
and multiplying it bynum_processes
.Feeling a bit confused? Not to worry, let's visualize it.
Experiments
We utilize Fine-tuning for text2image with LoRA as an example. Below is the training command:
The hyper-parameters are:
Thus,
And
epochs=6
is equivalent tosteps=30
.Additionally, introducing the argument
num_cycles=2
to the functionget_scheduler
exacerbates the error.Before the PR
--num_train_epochs
--max_train_steps
After the PR
--num_train_epochs
--max_train_steps
Fixes #8236
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul @eliphatfs