Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: cannot monitor state of queues at runtime #3383

Merged
merged 9 commits into from
Feb 11, 2019

Conversation

bluca
Copy link
Member

@bluca bluca commented Feb 3, 2019

Solution: add API and event

Fixes #3403

uint64_t inbound_queue_count_,
endpoint_uri_pair_t *endpoint_pair_)
{
event (*endpoint_pair_, outbound_queue_count_, ZMQ_EVENT_PIPES_STATS);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: refactor the event function to allow multiple values, so that a single event can be used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question here: by having the event raised from the application thread, we run into the issue that a dummy getsockopt or poll has to be done on the monitored socket for the event to be raised. This would be solved by doing this in the pipe class from the I/O thread, but the issue then becomes that the I/O thread, rather than just processing a command and forwarding it, has to acquire the monitor mutex, which is most likely way worse for performances.

I'd be inclined to leave this to be raised on the application thread. Thoughts? @somdoron @sigiesec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often does the mutex happen? Is it happening if monitoring is not enabled as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only when the request is triggered and being processed - so once per probe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often a probe? is it by interval or user requesting it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user requesting it, although i'm considering allowing an interval as well, as it might be useful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, application thread is better.
Why does the monitoring need mutex?

Copy link
Member

@sigiesec sigiesec Feb 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitoring needs a mutex, since the stored monitor socket and the set of enabled events is accessed from both the application (through zmq_socket_monitor*) and the I/O thread. Maybe this could be solved without a mutex, but it has been like this for a long time. A command could be used to update the values instead? But the behaviour of zmq_socket_monitor should be synchronous, so we would need to wait for the command to be received.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered waiting for the command, but especially without unit tests for the commands pipe it would get tricky avoiding regressions.

@bluca
Copy link
Member Author

bluca commented Feb 3, 2019

output from the temporary test print:

addr tcp://127.0.0.1:40373 tcp://127.0.0.1:55920 1000
addr tcp://127.0.0.1:40373 tcp://127.0.0.1:55920 1
addr tcp://127.0.0.1:38895 tcp://127.0.0.1:56834 100
addr tcp://127.0.0.1:38895 tcp://127.0.0.1:56834 1
addr tcp://127.0.0.1:33077 tcp://127.0.0.1:40134 500
addr tcp://127.0.0.1:33077 tcp://127.0.0.1:40134 1
tests/test_hwm_pubsub.cpp:378:test_tcp:PASS
tests/test_hwm_pubsub.cpp:379:test_inproc:PASS
addr ipc:///tmp/tmpIuNx7l/socket ipc:// 500
addr ipc:///tmp/tmpIuNx7l/socket ipc:// 1
addr ipc:///tmp/tmpQEi6hE/socket ipc:// 100
addr ipc:///tmp/tmpQEi6hE/socket ipc:// 1
addr ipc:///tmp/tmpaZi66Y/socket ipc:// 500
addr ipc:///tmp/tmpaZi66Y/socket ipc:// 1
tests/test_hwm_pubsub.cpp:381:test_ipc:PASS
tests/test_hwm_pubsub.cpp:383:test_reset_hwm:PASS

-----------------------
4 Tests 0 Failures 0 Ignored 
OK

@bluca bluca force-pushed the queue_monitor branch 2 times, most recently from bdfb50c to 5fbaab9 Compare February 4, 2019 08:40
@bluca
Copy link
Member Author

bluca commented Feb 4, 2019

Currently working on improving the tests, then I'll squash the double events into one

@bluca bluca force-pushed the queue_monitor branch 4 times, most recently from ffa58f0 to fd043e8 Compare February 5, 2019 15:24
@bluca
Copy link
Member Author

bluca commented Feb 5, 2019

I've done the single-event change, and the test appears to work fine with multiple pipes, with different queue lengths

@bluca bluca force-pushed the queue_monitor branch 4 times, most recently from 049a460 to 64d2681 Compare February 6, 2019 17:22
@bluca
Copy link
Member Author

bluca commented Feb 6, 2019

The valgrind build is still failing, locally too, most likely a race condition. Once that is solved, this should be ready

@sigiesec
Copy link
Member

sigiesec commented Feb 6, 2019

I ran with valgrind locally and did not manage to let it fail. But I saw that already the basic v2 monitor test is failing sometimes on travis with valgrind, where an unexpected event is received. Not yet sure if this is something to expect in the test.

@bluca
Copy link
Member Author

bluca commented Feb 6, 2019

yeah there's 2 extra events, a disconnect and a retried, when running in valgrind - not sure why, probably the monitor terminates later due to the slowness

@bluca bluca force-pushed the queue_monitor branch 2 times, most recently from 25da146 to bf45992 Compare February 6, 2019 23:12
bluca added 5 commits February 7, 2019 16:24
Solution: check for the available version instead
Solution: do it, as above 32 bytes they might be on the heap
@bluca
Copy link
Member Author

bluca commented Feb 7, 2019

and I can't reproduce it

@sigiesec
Copy link
Member

sigiesec commented Feb 8, 2019

I think the cause for the monitor test failure is the following:

  • while we initiate the closing of the client socket before the server socket, this is asynchronous, so the server socket might close first, causing the client to try to reconnect before being closed
  • this needs to be synchronized by first waiting for the ZMQ_EVENT_CLOSED on the client before proceeding to close the server socket (or generalizing the assertions by expecting this flow of events)

The same race condition probably already exists in the old test case, but happens less often.

@sigiesec
Copy link
Member

sigiesec commented Feb 8, 2019

Eh sorry, we don't get a ZMQ_EVENT_CLOSED on the client side at all. This is only emitted by the connecter at the moment, if the socket is closed before a connection is established. Not sure if this is really consistent.

We only get a ZMQ_EVENT_MONITOR_STOPPED at the moment.

@bluca
Copy link
Member Author

bluca commented Feb 8, 2019

The problem is that the test only enables the _STATS event, nothing else - so it can't be an issue of wrong event expected :-/

@sigiesec
Copy link
Member

sigiesec commented Feb 8, 2019

Ok, I think we are talking about two different things here. I will try to give the stats test a look later on.

@bluca
Copy link
Member Author

bluca commented Feb 8, 2019

Yeah I fixed the other tests in a commit here - as you said the events didn't match every now and then. Right now it's the ipv4 stats test that fails, with no discernible log message whatsoever, and can't be reproduced locally :-/

@bluca
Copy link
Member Author

bluca commented Feb 8, 2019

it looks like it's timing out when doing the send in a loop, reducing the HWM seems to help, but now breaks on osx...

@bluca bluca force-pushed the queue_monitor branch 4 times, most recently from 091e9c7 to c01eaec Compare February 8, 2019 23:21
@bluca
Copy link
Member Author

bluca commented Feb 8, 2019

Ok, I finally got it - it was a leftover from when the test initially was pub-sub, and was busy-looping with _DONTWAIT. It was too much for valgrind, which just timed out. Changing to blocking send, which is fine for push-pull, fixes the issue.

So, as far as I can see, this is finally ready

@bluca
Copy link
Member Author

bluca commented Feb 10, 2019

@sigiesec ping :-) does it look good now?

// The endpoints strings are not set on bind, set them here so that
// events can use them.
pipes[0]->set_endpoint_pair (engine_->get_endpoint ());
pipes[1]->set_endpoint_pair (engine_->get_endpoint ());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, somehow it feels that the endpoint pair should be reversed on one of the pipes. But I don't oversee where this might have an effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how one interprets the local and remote attributes - if the reference point is the application thread then yes, but if the reference point is the socket itself then no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, local and remote are the same wrt the socket. Still the direction is different. Hm. Let's see, when analysing the data from the monitoring :)

@sigiesec
Copy link
Member

Yes, looks good to me. I added two minor comments, but these can be addressed later on :)
Hope I will be able to test this soon on our side :)

Solution: add API and ZMQ_EVENT_PIPES_STATS event which generates 2
values, one for the egress and one for the ingress pipes respectively.
Refactor the events code to be able to send multiple values.
@bluca
Copy link
Member Author

bluca commented Feb 10, 2019

great - so can we merge now?

@sigiesec sigiesec merged commit 119a258 into zeromq:master Feb 11, 2019
@sigiesec sigiesec deleted the queue_monitor branch February 11, 2019 09:49
@sigiesec
Copy link
Member

@bluca Merged it now, thanks a lot for working on this :)

@bluca
Copy link
Member Author

bluca commented Feb 11, 2019

thanks!

@viciious
Copy link
Contributor

viciious commented Nov 7, 2019

Sorry for the necro-post, but could someone please clarify the following: the docs says that "the pipe stats on the sender will not get updated until the receiver has processed at least lwm ((hwm + 1) / 2) messages". What happens if the receiver has been N/A the whole time: e.g. the network has been down? Will the stat message report 0 queue length?

@bluca
Copy link
Member Author

bluca commented Nov 7, 2019

yes

@viciious
Copy link
Contributor

viciious commented Nov 7, 2019

yes

Oh well, this is quite useless for our monitoring purposes then...

What is "lwm ((hwm + 1) / 2)" anyway? Is it lwm multipled by ((hwm + 1) / 2) or something else?

@viciious
Copy link
Contributor

viciious commented Nov 7, 2019

yes

Sorry, but looking at the code, it appears not to be the case.

    send_pipe_peer_stats (_peer, _msgs_written - _peers_msgs_read, socket_base_,
                          ep);

We've been using a custom patched zeromq internally for quite a while, with similar queue length monitoring logic. From I can remember, _peers_msgs_read on the sender indeed only gets updated when the receiver has reached its lwm, but _msgs_written is incremented whenever the sender has queued another message. In case the receiver is offline, send_pipe_peer_stats will simply write _msgs_written to the stats message as _peers_msgs_read is 0. Am I missing something here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error building w/-DENABLE_DRAFTS=On -DWITH_DOCS=On
4 participants