Print processes that ended in "unexpected final state" #3426

stevenengler · 2024-10-21T15:15:02Z

When a shadow simulation ends and not all processes are in their expected final state, shadow prints:

** Shadow did not complete successfully: Failed to run the simulation
**   9 managed processes in unexpected final state
** See the log for details

to stderr, and expects the user to look through possibly hundreds of MBs of log messages to figure out which processes failed, and what state they were actually in.

Shadow should list the processes here (in stderr) and at the end of the log output.

The text was updated successfully, but these errors were encountered:

sporksmith · 2024-10-21T15:58:14Z

to stderr, and expects the user to look through possibly hundreds of MBs of log messages to figure out which processes failed, and what state they were actually in.

grepping for ERROR level logs should find the relevant lines identifying the process. I think we also try to write ERROR level logs to stderr in addition to stdout which is supposed to make them show up on the console when stdout is being redirected, but I think we try not to do that when stdout and stderr are going to the same destination, and that is incorrectly suppressing the stderr write in Tor's CI.

stevenengler · 2024-10-21T16:09:16Z

grepping for ERROR level logs should find the relevant lines identifying the process

I think we should tell people exactly what to grep for in this case (just the "ERROR" string I think?).

But it's not super nice for people in the CI, since they have to download for example a 100 MB log artifact and grep locally.

I think we also try to write ERROR level logs to stderr in addition to stdout which is supposed to make them show up on the console when stdout is being redirected, but I think we try not to do that when stdout and stderr are going to the same destination, and that is incorrectly suppressing the stderr write in Tor's CI.

As far as I can tell, the logs are written to a file, so stdout and stderr should be pointing to different destinations. Have you looked into this before? (I think you mentioned it before?) Is this a bug in shadow, or some weird GitLab thing?

sporksmith · 2024-10-21T18:48:31Z

As far as I can tell, the logs are written to a file, so stdout and stderr should be pointing to different destinations. Have you looked into this before? (I think you mentioned it before?) Is this a bug in shadow, or some weird GitLab thing?

Here it is:

https://github.com/shadow/shadow/blob/9bde9ad0fee46ab1e7136be485f1eaaf773049ad/src/main/shadow.rs#L116C23-L118

Ah right, we only do it when we detect that stdout is not a terminal and stderr is, to try to handle the common case that stdout is redirected to a file and stderr is the terminal. I think in the Tor (gitlab) CI we end up deciding it's not a terminal.

I'm not sure whether it's worth fixing the terminal detection logic, since I'm not sure in general what the "right" answer is as to whether the job output is a terminal in this case.

Maybe instead of trying to suppress the stderr outputs, we we could add some prefix to distinguish them in the case that they're written to the same destination? I think in general stderr is supposed to be more "human readable". Maybe it could be something like "An error was logged: "

sporksmith · 2024-10-21T18:49:39Z

(I think I was remembering the original idea to detect when the two descriptors are writing to the same file description, but that there's no general way to detect that as far as we could tell)

sporksmith · 2024-10-21T19:03:09Z

(I think I was remembering the original idea to detect when the two descriptors are writing to the same file description, but that there's no general way to detect that as far as we could tell)

Actually, I think using fstat and comparing the device and inode would probably mostly work? But maybe we're trying to be too clever here and just prefixing the stderr lines would have less potential for weird edge cases and surprises.

Trying to detect whether stdout and stderr are terminals are tricky. In particular, in CI environments where job output is being logged but is also displayed in some web-pseudo-terminal it's difficult to say what the right behavior is. It's probably better to have consistent behavior here that doesn't quietly change under you based on what type of device we think stdout and stderr are. Effectively reverts 223aa4a. Fixes shadow#3426.

Trying to detect whether stdout and stderr are terminals are tricky. In particular, in CI environments where job output is being logged but is also displayed in some web-pseudo-terminal it's difficult to say what the right behavior is. It's probably better to have consistent behavior here that doesn't quietly change under you based on what type of device we think stdout and stderr are. Effectively reverts 223aa4a. Progress on shadow#3426.

Trying to detect whether stdout and stderr are terminals and deduplicating based on that can be tricky, e.g. in CI environments.. Instead: * Make the `stderr` version more of a human-consumable message instead of the raw log line, which also makes the duplication easier to understand in the case where stdout and stderr are merged. * Disable this feature in our cmake tests, where stdout and and stderr are merged. New output on a process failure in the unmerged case (where stdout is written to a file and stderr is the terminal): ** Starting Shadow 3.2.0 Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101 Error: Failed to run the simulation Error: Error: Caused by: Error: 1 managed processes in unexpected final state ** Shadow did not complete successfully: Failed to run the simulation ** 1 managed processes in unexpected final state ** See the log for details The merged case is a little ugly, but at least it's harder to mistake the duplicated message on stderr for a second distinct log line: ... 00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG] [testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process 'testnode.test_sleep.1000' has completed or is otherwise no longer running Error: process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101 00:00:00.016660 [507267:shadow-worker] 00:00:01.000000000 [ERROR] [testnode:11.0.0.1] [process.rs:1571] [shadow_rs::host::process] process 'testnode.test_sleep.1000' exited with status Normal(101); expected end state was exited: 0 but was exited: 101 00:00:00.016710 [507267:shadow-worker] 00:00:01.000000000 [TRACE] [testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not notifying parent of exit: no signal specified ... Progress on #3426.

stevenengler added the Type: Bug Error or flaw producing unexpected results label Oct 21, 2024

stevenengler added this to the Code health and maintenance milestone Oct 21, 2024

stevenengler added Type: Enhancement New functionality or improved design and removed Type: Bug Error or flaw producing unexpected results labels Oct 21, 2024

sporksmith mentioned this issue Oct 22, 2024

Replace --log-errors-to-tty with --report-errors-to-stderr #3428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print processes that ended in "unexpected final state" #3426

Print processes that ended in "unexpected final state" #3426

stevenengler commented Oct 21, 2024

sporksmith commented Oct 21, 2024

stevenengler commented Oct 21, 2024 •

edited

Loading

sporksmith commented Oct 21, 2024

sporksmith commented Oct 21, 2024

sporksmith commented Oct 21, 2024

Print processes that ended in "unexpected final state" #3426

Print processes that ended in "unexpected final state" #3426

Comments

stevenengler commented Oct 21, 2024

sporksmith commented Oct 21, 2024

stevenengler commented Oct 21, 2024 • edited Loading

sporksmith commented Oct 21, 2024

sporksmith commented Oct 21, 2024

sporksmith commented Oct 21, 2024

stevenengler commented Oct 21, 2024 •

edited

Loading