Remove component status infrastructure in favor of reusing start/stop #436

brandur · 2024-07-08T04:03:26Z

We've made quite a few changes in recent months around winding
stop/start into everything to the point where all services are now
compliant with this concept, including the main client.

Quite a long time ago we put in a "component monitor" framework to help
tests wait for clients to successfully start up. It worked reasonably
well, but is a large amount of code, not very generic, and involves
winding callback functions into all components, which is a little
unsightly.

Here, I'm proposing that we replace the job component monitor was doing
with a more simple wait on the Started channel of the client:

startClient(ctx, t, client)
riversharedtest.WaitOrTimeout(t, client.baseStartStop.Started())

This is made possible because similar to how the component monitor
framework worked, the client doesn't signal that it's fully started
until it's successfully waited for all its subservices to start up:

func (c *Client[TTx]) Start(ctx context.Context) error {
    ...

    go func() {
        // Wait for all subservices to start up before signaling our own start.
        // This isn't strictly needed, but gives tests a way to fully confirm
        // that all goroutines for subservices are spun up before continuing.
        //
        // Stop also cancels the "started" channel, so in case of a context
        // cancellation, this statement will fall through. The client will
        // briefly start, but then immediately stop again.
        startstop.WaitAllStarted(append(
            c.services,
            producerServices..., // see comment on this variable
        )...)

        started()
        defer stopped()

So all in all, we should be able to get similar guarantees for test
purposes while needing less code and getting more reusability (in that
non-client services can also use exactly the same code to wait for start
up).

We've made quite a few changes in recent months around winding stop/start into everything to the point where all services are now compliant with this concept, including the main client. Quite a long time ago we put in a "component monitor" framework to help tests wait for clients to successfully start up. It worked reasonably well, but is a large amount of code, not very generic, and involves winding callback functions into all components, which is a little unsightly. Here, I'm proposing that we replace the job component monitor was doing with a more simple wait on the `Started` channel of the client: startClient(ctx, t, client) riversharedtest.WaitOrTimeout(t, client.baseStartStop.Started()) This is made possible because similar to how the component monitor framework worked, the client doesn't signal that it's fully started until it's successfully waited for all its subservices to start up: func (c *Client[TTx]) Start(ctx context.Context) error { ... go func() { // Wait for all subservices to start up before signaling our own start. // This isn't strictly needed, but gives tests a way to fully confirm // that all goroutines for subservices are spun up before continuing. // // Stop also cancels the "started" channel, so in case of a context // cancellation, this statement will fall through. The client will // briefly start, but then immediately stop again. startstop.WaitAllStarted(append( c.services, producerServices..., // see comment on this variable )...) started() defer stopped() So all in all, we should be able to get similar guarantees for test purposes while needing less code and getting more reusability (in that non-client services can also use exactly the same code to wait for start up).

brandur · 2024-07-08T05:32:24Z

@bgentry I ran the test matrix 12x times for a total of 72 runs with zero failures. Pretty sure this works — what do you think?

bgentry · 2024-07-08T14:32:08Z

The one thing that concerns me about this removal is that it feels like we're losing insight into what we're still waiting on, or which components may have failed and been restarted. For example, at any given moment, which components are healthy and which ones are having errors? Is there some kind of debug snapshot that could be printed or serialized with this info for all internal components?

I think we can probably find a way to work that into what we're replacing this with so long as we're mindful of it.

bgentry

In addition to what I mentioned in the last comment, another thing missing here is a way for users of the client to stay informed of the client's health so that they might decide whether to exit or restart the client. Again I think this is something we can find a way to work into the new model, so long as we're aware of it.

brandur · 2024-07-09T00:21:12Z

Awesome. Yeah I've got some ideas for that for sure — the first step will be to have services start to identify themselves so we can produce better error messages in case something failed to start or stop. We can go from there. Thanks!

brandur requested a review from bgentry July 8, 2024 05:32

bgentry approved these changes Jul 8, 2024

View reviewed changes

brandur merged commit 868d569 into master Jul 9, 2024
10 checks passed

brandur deleted the brandur-remove-client-monitor branch July 9, 2024 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove component status infrastructure in favor of reusing start/stop #436

Remove component status infrastructure in favor of reusing start/stop #436

brandur commented Jul 8, 2024

brandur commented Jul 8, 2024

bgentry commented Jul 8, 2024

bgentry left a comment

brandur commented Jul 9, 2024

Remove component status infrastructure in favor of reusing start/stop #436

Remove component status infrastructure in favor of reusing start/stop #436

Conversation

brandur commented Jul 8, 2024

brandur commented Jul 8, 2024

bgentry commented Jul 8, 2024

bgentry left a comment

Choose a reason for hiding this comment

brandur commented Jul 9, 2024