Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: fix platform shutdown deadlock #56827

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Qard
Copy link
Member

@Qard Qard commented Jan 30, 2025

Fixes #54918

Each worker is signalling its own completion of tasks independently and so should only be signalling for one corresponding drain otherwise the count of outstanding tasks goes out of sync and the process will never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.

Not sure if there was a reason for it to be a broadcast in the first place, but if so then the outstanding_tasks_ count adjustment needs to factor that in properly.

Each worker is signalling its own completion of tasks independently
and so should only be signalling for one corresponding drain otherwise
the count of outstanding tasks goes out of sync and the process will
never stop waiting for tasks when it should be exiting.

It just needs to be calling Signal rather than Broadcast.
@Qard Qard added confirmed-bug Issues with confirmed bugs. c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. v8 platform Issues and PRs related to Node's v8::Platform implementation. labels Jan 30, 2025
@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label Jan 30, 2025
@Qard Qard added the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jan 30, 2025
@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link

codecov bot commented Jan 30, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.20%. Comparing base (304bb9c) to head (d674a68).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #56827   +/-   ##
=======================================
  Coverage   89.20%   89.20%           
=======================================
  Files         663      663           
  Lines      192012   192012           
  Branches    36929    36933    +4     
=======================================
+ Hits       171286   171293    +7     
- Misses      13582    13590    +8     
+ Partials     7144     7129   -15     
Files with missing lines Coverage Δ
src/node_platform.cc 87.83% <100.00%> (ø)

... and 23 files with indirect coverage changes

@targos
Copy link
Member

targos commented Jan 30, 2025

This seems to bring new flakyness to --loader tests, which involve a worker.

@Qard
Copy link
Member Author

Qard commented Jan 30, 2025

I think it may also have not actually fixed the issue, just made it more rare. I've been doing a very long run locally here. Before this I encountered the issue at 9.9k iterations, with this it eventually popped up at 117k iterations.

@lpinca
Copy link
Member

lpinca commented Jan 30, 2025

Yes, unfortunately it does not seem to fix the issue:

$ python3 tools/test.py --repeat=1000 test/parallel/test-net-write-fully-async-hex-string.js
[00:35|%  98|+ 984|-   0]: release test-net-write-fully-async-hex-string
$ lldb -p 8578
(lldb) process attach --pid 8578
Process 8578 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7ff8111b1aaa <+10>: jae    0x7ff8111b1ab4 ; <+20>
    0x7ff8111b1aac <+12>: movq   %rax, %rdi
    0x7ff8111b1aaf <+15>: jmp    0x7ff8111af662 ; cerror_nocancel
    0x7ff8111b1ab4 <+20>: retq   
Target 0: (node) stopped.
Executable module set to "/Users/luigi/code/node/out/Release/node".
Architecture set to: x86_64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007ff8111b1aaa libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007ff8111f07a8 libsystem_pthread.dylib`_pthread_cond_wait + 1193
    frame #2: 0x0000000104ffb893 node`uv_cond_wait(cond=<unavailable>, mutex=<unavailable>) at thread.c:819:7 [opt]
    frame #3: 0x00000001041148bb node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::LibuvMutexTraits::cond_wait(cond=0x00007fdaa3204e78, mutex=0x00007fdaa3204e08) at node_mutex.h:175:5 [opt]
    frame #4: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::ConditionVariableBase<node::LibuvMutexTraits>::Wait(this=0x00007fdaa3204e78, scoped_lock=<unavailable>) at node_mutex.h:249:3 [opt]
    frame #5: 0x00000001041148a1 node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::TaskQueue<v8::Task>::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:640:20 [opt]
    frame #6: 0x000000010411488b node`node::NodePlatform::DrainTasks(v8::Isolate*) [inlined] node::WorkerThreadsTaskRunner::BlockingDrain(this=0x00007fdaa3204e08) at node_platform.cc:214:25 [opt]
    frame #7: 0x000000010411488b node`node::NodePlatform::DrainTasks(this=0x0000600002740000, isolate=<unavailable>) at node_platform.cc:467:33 [opt]
    frame #8: 0x0000000103f7ca60 node`node::SpinEventLoopInternal(env=0x00007fdaa5049c00) at embed_helpers.cc:44:17 [opt]
    frame #9: 0x00000001040d9b33 node`node::NodeMainInstance::Run() [inlined] node::NodeMainInstance::Run(this=<unavailable>, exit_code=<unavailable>, env=0x00007fdaa5049c00) at node_main_instance.cc:111:9 [opt]
    frame #10: 0x00000001040d9a99 node`node::NodeMainInstance::Run(this=<unavailable>) at node_main_instance.cc:100:3 [opt]
    frame #11: 0x00000001040414e6 node`node::Start(int, char**) [inlined] node::StartInternal(argc=<unavailable>, argv=<unavailable>) at node.cc:1488:24 [opt]
    frame #12: 0x0000000104041342 node`node::Start(argc=<unavailable>, argv=<unavailable>) at node.cc:1495:27 [opt]
    frame #13: 0x00007ff810e612cd dyld`start + 1805
(lldb)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. confirmed-bug Issues with confirmed bugs. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. v8 platform Issues and PRs related to Node's v8::Platform implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock at process shutdown
5 participants