-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: fix platform shutdown deadlock #56827
base: main
Are you sure you want to change the base?
Conversation
Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting. It just needs to be calling Signal rather than Broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #56827 +/- ##
=======================================
Coverage 89.20% 89.20%
=======================================
Files 663 663
Lines 192012 192012
Branches 36929 36933 +4
=======================================
+ Hits 171286 171293 +7
- Misses 13582 13590 +8
+ Partials 7144 7129 -15
|
This seems to bring new flakyness to |
I think it may also have not actually fixed the issue, just made it more rare. I've been doing a very long run locally here. Before this I encountered the issue at 9.9k iterations, with this it eventually popped up at 117k iterations. |
Yes, unfortunately it does not seem to fix the issue:
|
Fixes #54918
Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting.
It just needs to be calling Signal rather than Broadcast.
Not sure if there was a reason for it to be a broadcast in the first place, but if so then the
outstanding_tasks_
count adjustment needs to factor that in properly.