Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print processes that ended in "unexpected final state" #3426

Open
stevenengler opened this issue Oct 21, 2024 · 5 comments
Open

Print processes that ended in "unexpected final state" #3426

stevenengler opened this issue Oct 21, 2024 · 5 comments
Labels
Type: Enhancement New functionality or improved design

Comments

@stevenengler
Copy link
Contributor

When a shadow simulation ends and not all processes are in their expected final state, shadow prints:

** Shadow did not complete successfully: Failed to run the simulation
**   9 managed processes in unexpected final state
** See the log for details

to stderr, and expects the user to look through possibly hundreds of MBs of log messages to figure out which processes failed, and what state they were actually in.

Shadow should list the processes here (in stderr) and at the end of the log output.

@stevenengler stevenengler added the Type: Bug Error or flaw producing unexpected results label Oct 21, 2024
@stevenengler stevenengler added Type: Enhancement New functionality or improved design and removed Type: Bug Error or flaw producing unexpected results labels Oct 21, 2024
@sporksmith
Copy link
Contributor

to stderr, and expects the user to look through possibly hundreds of MBs of log messages to figure out which processes failed, and what state they were actually in.

grepping for ERROR level logs should find the relevant lines identifying the process. I think we also try to write ERROR level logs to stderr in addition to stdout which is supposed to make them show up on the console when stdout is being redirected, but I think we try not to do that when stdout and stderr are going to the same destination, and that is incorrectly suppressing the stderr write in Tor's CI.

@stevenengler
Copy link
Contributor Author

stevenengler commented Oct 21, 2024

grepping for ERROR level logs should find the relevant lines identifying the process

I think we should tell people exactly what to grep for in this case (just the "ERROR" string I think?).

But it's not super nice for people in the CI, since they have to download for example a 100 MB log artifact and grep locally.

I think we also try to write ERROR level logs to stderr in addition to stdout which is supposed to make them show up on the console when stdout is being redirected, but I think we try not to do that when stdout and stderr are going to the same destination, and that is incorrectly suppressing the stderr write in Tor's CI.

As far as I can tell, the logs are written to a file, so stdout and stderr should be pointing to different destinations. Have you looked into this before? (I think you mentioned it before?) Is this a bug in shadow, or some weird GitLab thing?

@sporksmith
Copy link
Contributor

As far as I can tell, the logs are written to a file, so stdout and stderr should be pointing to different destinations. Have you looked into this before? (I think you mentioned it before?) Is this a bug in shadow, or some weird GitLab thing?

Here it is:

https://github.com/shadow/shadow/blob/9bde9ad0fee46ab1e7136be485f1eaaf773049ad/src/main/shadow.rs#L116C23-L118

Ah right, we only do it when we detect that stdout is not a terminal and stderr is, to try to handle the common case that stdout is redirected to a file and stderr is the terminal. I think in the Tor (gitlab) CI we end up deciding it's not a terminal.

I'm not sure whether it's worth fixing the terminal detection logic, since I'm not sure in general what the "right" answer is as to whether the job output is a terminal in this case.

Maybe instead of trying to suppress the stderr outputs, we we could add some prefix to distinguish them in the case that they're written to the same destination? I think in general stderr is supposed to be more "human readable". Maybe it could be something like "An error was logged: "

@sporksmith
Copy link
Contributor

(I think I was remembering the original idea to detect when the two descriptors are writing to the same file description, but that there's no general way to detect that as far as we could tell)

@sporksmith
Copy link
Contributor

(I think I was remembering the original idea to detect when the two descriptors are writing to the same file description, but that there's no general way to detect that as far as we could tell)

Actually, I think using fstat and comparing the device and inode would probably mostly work? But maybe we're trying to be too clever here and just prefixing the stderr lines would have less potential for weird edge cases and surprises.

sporksmith added a commit to sporksmith/shadow that referenced this issue Oct 22, 2024
Trying to detect whether stdout and stderr are terminals are tricky.  In
particular, in CI environments where job output is being logged but is
also displayed in some web-pseudo-terminal it's difficult to say what
the right behavior is.

It's probably better to have consistent behavior here that doesn't
quietly change under you based on what type of device we think stdout
and stderr are.

Effectively reverts 223aa4a.

Fixes shadow#3426.
sporksmith added a commit to sporksmith/shadow that referenced this issue Oct 22, 2024
Trying to detect whether stdout and stderr are terminals are tricky.  In
particular, in CI environments where job output is being logged but is
also displayed in some web-pseudo-terminal it's difficult to say what
the right behavior is.

It's probably better to have consistent behavior here that doesn't
quietly change under you based on what type of device we think stdout
and stderr are.

Effectively reverts 223aa4a.

Fixes shadow#3426.
sporksmith added a commit to sporksmith/shadow that referenced this issue Oct 23, 2024
Trying to detect whether stdout and stderr are terminals are tricky.  In
particular, in CI environments where job output is being logged but is
also displayed in some web-pseudo-terminal it's difficult to say what
the right behavior is.

It's probably better to have consistent behavior here that doesn't
quietly change under you based on what type of device we think stdout
and stderr are.

Effectively reverts 223aa4a.

Progress on shadow#3426.
sporksmith added a commit that referenced this issue Oct 23, 2024
Trying to detect whether stdout and stderr are terminals and
deduplicating based on that can be tricky, e.g. in CI environments..

Instead:

* Make the `stderr` version more of a human-consumable message instead
of the raw log line, which also makes the duplication easier to
understand in the case where stdout and stderr are merged.
* Disable this feature in our cmake tests, where stdout and and stderr
are merged.

New output on a process failure in the unmerged case (where stdout is
written to a file and stderr is the terminal):

    ** Starting Shadow 3.2.0
Error: process 'testnode.test_sleep.1000' exited with status
Normal(101); expected end state was exited: 0 but was exited: 101
    Error: Failed to run the simulation
    Error:
    Error: Caused by:
    Error:     1 managed processes in unexpected final state
** Shadow did not complete successfully: Failed to run the simulation
    **   1 managed processes in unexpected final state
    ** See the log for details

The merged case is a little ugly, but at least it's harder to mistake
the duplicated message on stderr for a second distinct log line:

    ...
00:00:00.016603 [507267:shadow-worker] 00:00:01.000000000 [DEBUG]
[testnode:11.0.0.1] [process.rs:1486] [shadow_rs::host::process] process
'testnode.test_sleep.1000' has completed or is otherwise no longer
running
Error: process 'testnode.test_sleep.1000' exited with status
Normal(101); expected end state was exited: 0 but was exited: 101
00:00:00.016660 [507267:shadow-worker] 00:00:01.000000000 [ERROR]
[testnode:11.0.0.1] [process.rs:1571] [shadow_rs::host::process] process
'testnode.test_sleep.1000' exited with status Normal(101); expected end
state was exited: 0 but was exited: 101
00:00:00.016710 [507267:shadow-worker] 00:00:01.000000000 [TRACE]
[testnode:11.0.0.1] [process.rs:750] [shadow_rs::host::process] Not
notifying parent of exit: no signal specified
    ...

Progress on #3426.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New functionality or improved design
Projects
None yet
Development

No branches or pull requests

2 participants