Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove component status infrastructure in favor of reusing start/stop #436

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

brandur
Copy link
Contributor

@brandur brandur commented Jul 8, 2024

We've made quite a few changes in recent months around winding
stop/start into everything to the point where all services are now
compliant with this concept, including the main client.

Quite a long time ago we put in a "component monitor" framework to help
tests wait for clients to successfully start up. It worked reasonably
well, but is a large amount of code, not very generic, and involves
winding callback functions into all components, which is a little
unsightly.

Here, I'm proposing that we replace the job component monitor was doing
with a more simple wait on the Started channel of the client:

startClient(ctx, t, client)
riversharedtest.WaitOrTimeout(t, client.baseStartStop.Started())

This is made possible because similar to how the component monitor
framework worked, the client doesn't signal that it's fully started
until it's successfully waited for all its subservices to start up:

func (c *Client[TTx]) Start(ctx context.Context) error {
    ...

    go func() {
        // Wait for all subservices to start up before signaling our own start.
        // This isn't strictly needed, but gives tests a way to fully confirm
        // that all goroutines for subservices are spun up before continuing.
        //
        // Stop also cancels the "started" channel, so in case of a context
        // cancellation, this statement will fall through. The client will
        // briefly start, but then immediately stop again.
        startstop.WaitAllStarted(append(
            c.services,
            producerServices..., // see comment on this variable
        )...)

        started()
        defer stopped()

So all in all, we should be able to get similar guarantees for test
purposes while needing less code and getting more reusability (in that
non-client services can also use exactly the same code to wait for start
up).

We've made quite a few changes in recent months around winding
stop/start into everything to the point where all services are now
compliant with this concept, including the main client.

Quite a long time ago we put in a "component monitor" framework to help
tests wait for clients to successfully start up. It worked reasonably
well, but is a large amount of code, not very generic, and involves
winding callback functions into all components, which is a little
unsightly.

Here, I'm proposing that we replace the job component monitor was doing
with a more simple wait on the `Started` channel of the client:

    startClient(ctx, t, client)
    riversharedtest.WaitOrTimeout(t, client.baseStartStop.Started())

This is made possible because similar to how the component monitor
framework worked, the client doesn't signal that it's fully started
until it's successfully waited for all its subservices to start up:

    func (c *Client[TTx]) Start(ctx context.Context) error {
        ...

        go func() {
            // Wait for all subservices to start up before signaling our own start.
            // This isn't strictly needed, but gives tests a way to fully confirm
            // that all goroutines for subservices are spun up before continuing.
            //
            // Stop also cancels the "started" channel, so in case of a context
            // cancellation, this statement will fall through. The client will
            // briefly start, but then immediately stop again.
            startstop.WaitAllStarted(append(
                c.services,
                producerServices..., // see comment on this variable
            )...)

            started()
            defer stopped()

So all in all, we should be able to get similar guarantees for test
purposes while needing less code and getting more reusability (in that
non-client services can also use exactly the same code to wait for start
up).
@brandur
Copy link
Contributor Author

brandur commented Jul 8, 2024

@bgentry I ran the test matrix 12x times for a total of 72 runs with zero failures. Pretty sure this works — what do you think?

@brandur brandur requested a review from bgentry July 8, 2024 05:32
@bgentry
Copy link
Contributor

bgentry commented Jul 8, 2024

The one thing that concerns me about this removal is that it feels like we're losing insight into what we're still waiting on, or which components may have failed and been restarted. For example, at any given moment, which components are healthy and which ones are having errors? Is there some kind of debug snapshot that could be printed or serialized with this info for all internal components?

I think we can probably find a way to work that into what we're replacing this with so long as we're mindful of it.

Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what I mentioned in the last comment, another thing missing here is a way for users of the client to stay informed of the client's health so that they might decide whether to exit or restart the client. Again I think this is something we can find a way to work into the new model, so long as we're aware of it.

@brandur
Copy link
Contributor Author

brandur commented Jul 9, 2024

Awesome. Yeah I've got some ideas for that for sure — the first step will be to have services start to identify themselves so we can produce better error messages in case something failed to start or stop. We can go from there. Thanks!

@brandur brandur merged commit 868d569 into master Jul 9, 2024
10 checks passed
@brandur brandur deleted the brandur-remove-client-monitor branch July 9, 2024 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants