-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: cannot monitor state of queues at runtime #3383
Conversation
src/socket_base.cpp
Outdated
uint64_t inbound_queue_count_, | ||
endpoint_uri_pair_t *endpoint_pair_) | ||
{ | ||
event (*endpoint_pair_, outbound_queue_count_, ZMQ_EVENT_PIPES_STATS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: refactor the event function to allow multiple values, so that a single event can be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question here: by having the event raised from the application thread, we run into the issue that a dummy getsockopt or poll has to be done on the monitored socket for the event to be raised. This would be solved by doing this in the pipe class from the I/O thread, but the issue then becomes that the I/O thread, rather than just processing a command and forwarding it, has to acquire the monitor mutex, which is most likely way worse for performances.
I'd be inclined to leave this to be raised on the application thread. Thoughts? @somdoron @sigiesec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How often does the mutex happen? Is it happening if monitoring is not enabled as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only when the request is triggered and being processed - so once per probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how often a probe? is it by interval or user requesting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user requesting it, although i'm considering allowing an interval as well, as it might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes, application thread is better.
Why does the monitoring need mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The monitoring needs a mutex, since the stored monitor socket and the set of enabled events is accessed from both the application (through zmq_socket_monitor*) and the I/O thread. Maybe this could be solved without a mutex, but it has been like this for a long time. A command could be used to update the values instead? But the behaviour of zmq_socket_monitor should be synchronous, so we would need to wait for the command to be received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered waiting for the command, but especially without unit tests for the commands pipe it would get tricky avoiding regressions.
output from the temporary test print:
|
bdfb50c
to
5fbaab9
Compare
Currently working on improving the tests, then I'll squash the double events into one |
ffa58f0
to
fd043e8
Compare
I've done the single-event change, and the test appears to work fine with multiple pipes, with different queue lengths |
049a460
to
64d2681
Compare
The valgrind build is still failing, locally too, most likely a race condition. Once that is solved, this should be ready |
I ran with valgrind locally and did not manage to let it fail. But I saw that already the basic v2 monitor test is failing sometimes on travis with valgrind, where an unexpected event is received. Not yet sure if this is something to expect in the test. |
yeah there's 2 extra events, a disconnect and a retried, when running in valgrind - not sure why, probably the monitor terminates later due to the slowness |
25da146
to
bf45992
Compare
Solution: remove them for now
Solution: fix formatting
Solution: add it to doc/Makefile.am
Solution: check for the available version instead
Solution: do it, as above 32 bytes they might be on the heap
and I can't reproduce it |
I think the cause for the monitor test failure is the following:
The same race condition probably already exists in the old test case, but happens less often. |
Eh sorry, we don't get a ZMQ_EVENT_CLOSED on the client side at all. This is only emitted by the connecter at the moment, if the socket is closed before a connection is established. Not sure if this is really consistent. We only get a ZMQ_EVENT_MONITOR_STOPPED at the moment. |
The problem is that the test only enables the _STATS event, nothing else - so it can't be an issue of wrong event expected :-/ |
Ok, I think we are talking about two different things here. I will try to give the stats test a look later on. |
Yeah I fixed the other tests in a commit here - as you said the events didn't match every now and then. Right now it's the ipv4 stats test that fails, with no discernible log message whatsoever, and can't be reproduced locally :-/ |
it looks like it's timing out when doing the send in a loop, reducing the HWM seems to help, but now breaks on osx... |
091e9c7
to
c01eaec
Compare
Ok, I finally got it - it was a leftover from when the test initially was pub-sub, and was busy-looping with _DONTWAIT. It was too much for valgrind, which just timed out. Changing to blocking send, which is fine for push-pull, fixes the issue. So, as far as I can see, this is finally ready |
@sigiesec ping :-) does it look good now? |
// The endpoints strings are not set on bind, set them here so that | ||
// events can use them. | ||
pipes[0]->set_endpoint_pair (engine_->get_endpoint ()); | ||
pipes[1]->set_endpoint_pair (engine_->get_endpoint ()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, somehow it feels that the endpoint pair should be reversed on one of the pipes. But I don't oversee where this might have an effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on how one interprets the local and remote attributes - if the reference point is the application thread then yes, but if the reference point is the socket itself then no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, local and remote are the same wrt the socket. Still the direction is different. Hm. Let's see, when analysing the data from the monitoring :)
Yes, looks good to me. I added two minor comments, but these can be addressed later on :) |
Solution: add API and ZMQ_EVENT_PIPES_STATS event which generates 2 values, one for the egress and one for the ingress pipes respectively. Refactor the events code to be able to send multiple values.
great - so can we merge now? |
@bluca Merged it now, thanks a lot for working on this :) |
thanks! |
Sorry for the necro-post, but could someone please clarify the following: the docs says that "the pipe stats on the sender will not get updated until the receiver has processed at least lwm ((hwm + 1) / 2) messages". What happens if the receiver has been N/A the whole time: e.g. the network has been down? Will the stat message report 0 queue length? |
yes |
Oh well, this is quite useless for our monitoring purposes then... What is "lwm ((hwm + 1) / 2)" anyway? Is it lwm multipled by ((hwm + 1) / 2) or something else? |
Sorry, but looking at the code, it appears not to be the case.
We've been using a custom patched zeromq internally for quite a while, with similar queue length monitoring logic. From I can remember, |
Solution: add API and event
Fixes #3403