Problem: cannot monitor state of queues at runtime #3383

bluca · 2019-02-03T23:14:56Z

Solution: add API and event

src/session_base.cpp

bluca · 2019-02-03T23:16:18Z

src/socket_base.cpp

+  uint64_t inbound_queue_count_,
+  endpoint_uri_pair_t *endpoint_pair_)
+{
+    event (*endpoint_pair_, outbound_queue_count_, ZMQ_EVENT_PIPES_STATS);


TODO: refactor the event function to allow multiple values, so that a single event can be used

Another question here: by having the event raised from the application thread, we run into the issue that a dummy getsockopt or poll has to be done on the monitored socket for the event to be raised. This would be solved by doing this in the pipe class from the I/O thread, but the issue then becomes that the I/O thread, rather than just processing a command and forwarding it, has to acquire the monitor mutex, which is most likely way worse for performances.

I'd be inclined to leave this to be raised on the application thread. Thoughts? @somdoron @sigiesec

How often does the mutex happen? Is it happening if monitoring is not enabled as well?

No, only when the request is triggered and being processed - so once per probe

how often a probe? is it by interval or user requesting it?

user requesting it, although i'm considering allowing an interval as well, as it might be useful

So yes, application thread is better.
Why does the monitoring need mutex?

The monitoring needs a mutex, since the stored monitor socket and the set of enabled events is accessed from both the application (through zmq_socket_monitor*) and the I/O thread. Maybe this could be solved without a mutex, but it has been like this for a long time. A command could be used to update the values instead? But the behaviour of zmq_socket_monitor should be synchronous, so we would need to wait for the command to be received.

I considered waiting for the command, but especially without unit tests for the commands pipe it would get tricky avoiding regressions.

tests/test_hwm_pubsub.cpp

bluca · 2019-02-03T23:17:29Z

output from the temporary test print:

addr tcp://127.0.0.1:40373 tcp://127.0.0.1:55920 1000
addr tcp://127.0.0.1:40373 tcp://127.0.0.1:55920 1
addr tcp://127.0.0.1:38895 tcp://127.0.0.1:56834 100
addr tcp://127.0.0.1:38895 tcp://127.0.0.1:56834 1
addr tcp://127.0.0.1:33077 tcp://127.0.0.1:40134 500
addr tcp://127.0.0.1:33077 tcp://127.0.0.1:40134 1
tests/test_hwm_pubsub.cpp:378:test_tcp:PASS
tests/test_hwm_pubsub.cpp:379:test_inproc:PASS
addr ipc:///tmp/tmpIuNx7l/socket ipc:// 500
addr ipc:///tmp/tmpIuNx7l/socket ipc:// 1
addr ipc:///tmp/tmpQEi6hE/socket ipc:// 100
addr ipc:///tmp/tmpQEi6hE/socket ipc:// 1
addr ipc:///tmp/tmpaZi66Y/socket ipc:// 500
addr ipc:///tmp/tmpaZi66Y/socket ipc:// 1
tests/test_hwm_pubsub.cpp:381:test_ipc:PASS
tests/test_hwm_pubsub.cpp:383:test_reset_hwm:PASS

-----------------------
4 Tests 0 Failures 0 Ignored 
OK

bluca · 2019-02-04T13:23:55Z

Currently working on improving the tests, then I'll squash the double events into one

bluca · 2019-02-05T15:26:20Z

I've done the single-event change, and the test appears to work fine with multiple pipes, with different queue lengths

src/socket_base.cpp

bluca · 2019-02-06T17:22:45Z

The valgrind build is still failing, locally too, most likely a race condition. Once that is solved, this should be ready

sigiesec · 2019-02-06T18:19:55Z

I ran with valgrind locally and did not manage to let it fail. But I saw that already the basic v2 monitor test is failing sometimes on travis with valgrind, where an unexpected event is received. Not yet sure if this is something to expect in the test.

bluca · 2019-02-06T22:33:01Z

yeah there's 2 extra events, a disconnect and a retried, when running in valgrind - not sure why, probably the monitor terminates later due to the slowness

Solution: remove them for now

Solution: fix formatting

Solution: add it to doc/Makefile.am

Solution: check for the available version instead

Solution: do it, as above 32 bytes they might be on the heap

bluca · 2019-02-07T22:21:46Z

and I can't reproduce it

sigiesec · 2019-02-08T09:12:35Z

I think the cause for the monitor test failure is the following:

while we initiate the closing of the client socket before the server socket, this is asynchronous, so the server socket might close first, causing the client to try to reconnect before being closed
this needs to be synchronized by first waiting for the ZMQ_EVENT_CLOSED on the client before proceeding to close the server socket (or generalizing the assertions by expecting this flow of events)

The same race condition probably already exists in the old test case, but happens less often.

sigiesec · 2019-02-08T09:18:03Z

Eh sorry, we don't get a ZMQ_EVENT_CLOSED on the client side at all. This is only emitted by the connecter at the moment, if the socket is closed before a connection is established. Not sure if this is really consistent.

We only get a ZMQ_EVENT_MONITOR_STOPPED at the moment.

bluca · 2019-02-08T10:04:38Z

The problem is that the test only enables the _STATS event, nothing else - so it can't be an issue of wrong event expected :-/

sigiesec · 2019-02-08T10:11:46Z

Ok, I think we are talking about two different things here. I will try to give the stats test a look later on.

bluca · 2019-02-08T10:12:53Z

Yeah I fixed the other tests in a commit here - as you said the events didn't match every now and then. Right now it's the ipv4 stats test that fails, with no discernible log message whatsoever, and can't be reproduced locally :-/

bluca · 2019-02-08T21:04:15Z

it looks like it's timing out when doing the send in a loop, reducing the HWM seems to help, but now breaks on osx...

bluca · 2019-02-08T23:22:03Z

Ok, I finally got it - it was a leftover from when the test initially was pub-sub, and was busy-looping with _DONTWAIT. It was too much for valgrind, which just timed out. Changing to blocking send, which is fine for push-pull, fixes the issue.

So, as far as I can see, this is finally ready

bluca · 2019-02-10T11:47:02Z

@sigiesec ping :-) does it look good now?

sigiesec · 2019-02-10T12:14:39Z

src/session_base.cpp

+        //  The endpoints strings are not set on bind, set them here so that
+        //  events can use them.
+        pipes[0]->set_endpoint_pair (engine_->get_endpoint ());
+        pipes[1]->set_endpoint_pair (engine_->get_endpoint ());


I am not sure, somehow it feels that the endpoint pair should be reversed on one of the pipes. But I don't oversee where this might have an effect.

I guess it depends on how one interprets the local and remote attributes - if the reference point is the application thread then yes, but if the reference point is the socket itself then no

Well yes, local and remote are the same wrt the socket. Still the direction is different. Hm. Let's see, when analysing the data from the monitoring :)

src/zmq.cpp

sigiesec · 2019-02-10T12:19:19Z

Yes, looks good to me. I added two minor comments, but these can be addressed later on :)
Hope I will be able to test this soon on our side :)

Solution: add API and ZMQ_EVENT_PIPES_STATS event which generates 2 values, one for the egress and one for the ingress pipes respectively. Refactor the events code to be able to send multiple values.

bluca · 2019-02-10T19:10:18Z

great - so can we merge now?

sigiesec · 2019-02-11T09:50:31Z

@bluca Merged it now, thanks a lot for working on this :)

bluca · 2019-02-11T09:57:04Z

thanks!

viciious · 2019-11-07T12:18:43Z

Sorry for the necro-post, but could someone please clarify the following: the docs says that "the pipe stats on the sender will not get updated until the receiver has processed at least lwm ((hwm + 1) / 2) messages". What happens if the receiver has been N/A the whole time: e.g. the network has been down? Will the stat message report 0 queue length?

bluca · 2019-11-07T12:26:06Z

yes

viciious · 2019-11-07T12:35:58Z

yes

Oh well, this is quite useless for our monitoring purposes then...

What is "lwm ((hwm + 1) / 2)" anyway? Is it lwm multipled by ((hwm + 1) / 2) or something else?

viciious · 2019-11-07T12:47:48Z

yes

Sorry, but looking at the code, it appears not to be the case.

    send_pipe_peer_stats (_peer, _msgs_written - _peers_msgs_read, socket_base_,
                          ep);

We've been using a custom patched zeromq internally for quite a while, with similar queue length monitoring logic. From I can remember, _peers_msgs_read on the sender indeed only gets updated when the receiver has reached its lwm, but _msgs_written is incremented whenever the sender has queued another message. In case the receiver is offline, send_pipe_peer_stats will simply write _msgs_written to the stats message as _peers_msgs_read is 0. Am I missing something here?

bluca commented Feb 3, 2019

View reviewed changes

src/session_base.cpp Outdated Show resolved Hide resolved

bluca commented Feb 3, 2019

View reviewed changes

tests/test_hwm_pubsub.cpp Outdated Show resolved Hide resolved

bluca force-pushed the queue_monitor branch 2 times, most recently from bdfb50c to 5fbaab9 Compare February 4, 2019 08:40

bluca force-pushed the queue_monitor branch 4 times, most recently from ffa58f0 to fd043e8 Compare February 5, 2019 15:24

bluca commented Feb 5, 2019

View reviewed changes

src/socket_base.cpp Outdated Show resolved Hide resolved

sigiesec reviewed Feb 5, 2019

View reviewed changes

src/socket_base.cpp Outdated Show resolved Hide resolved

bluca force-pushed the queue_monitor branch 4 times, most recently from 049a460 to 64d2681 Compare February 6, 2019 17:22

bluca force-pushed the queue_monitor branch from 64d2681 to 55e09bc Compare February 6, 2019 22:32

bluca force-pushed the queue_monitor branch 2 times, most recently from 25da146 to bf45992 Compare February 6, 2019 23:12

bluca mentioned this pull request Feb 7, 2019

error building w/-DENABLE_DRAFTS=On -DWITH_DOCS=On #3403

Closed

bluca added 5 commits February 7, 2019 16:24

Problem: symlinking manpages in automake does not work

8669734

Solution: remove them for now

Problem: zmq_socket_monitor_versioned manpage doesn't build

2edba12

Solution: fix formatting

Problem: zmq_socket_monitor_versioned manpage not generated

b20cb12

Solution: add it to doc/Makefile.am

Problem: test_monitor check for DRAFT has to be kept up to date

edf79df

Solution: check for the available version instead

Problem: testutil_monitoring does not close received messages

83946d5

Solution: do it, as above 32 bytes they might be on the heap

bluca force-pushed the queue_monitor branch from d42b65d to a92e634 Compare February 8, 2019 20:45

bluca force-pushed the queue_monitor branch 4 times, most recently from 091e9c7 to c01eaec Compare February 8, 2019 23:21

sigiesec reviewed Feb 10, 2019

View reviewed changes

src/zmq.cpp Outdated Show resolved Hide resolved

bluca force-pushed the queue_monitor branch from c01eaec to 0bc67f1 Compare February 10, 2019 16:08

Problem: cannot monitor state of queues at runtime

feadf6d

Solution: add API and ZMQ_EVENT_PIPES_STATS event which generates 2 values, one for the egress and one for the ingress pipes respectively. Refactor the events code to be able to send multiple values.

bluca force-pushed the queue_monitor branch from 0bc67f1 to feadf6d Compare February 10, 2019 16:33

sigiesec merged commit 119a258 into zeromq:master Feb 11, 2019

sigiesec deleted the queue_monitor branch February 11, 2019 09:49

bluca mentioned this pull request Feb 14, 2019

Make queue length and HWM state observable #3370

Closed

Problem: cannot monitor state of queues at runtime #3383

Problem: cannot monitor state of queues at runtime #3383

Conversation

bluca commented Feb 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sigiesec Feb 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluca commented Feb 3, 2019

bluca commented Feb 4, 2019

bluca commented Feb 5, 2019

bluca commented Feb 6, 2019

sigiesec commented Feb 6, 2019

bluca commented Feb 6, 2019

bluca commented Feb 7, 2019

sigiesec commented Feb 8, 2019

sigiesec commented Feb 8, 2019

bluca commented Feb 8, 2019

sigiesec commented Feb 8, 2019

bluca commented Feb 8, 2019

bluca commented Feb 8, 2019

bluca commented Feb 8, 2019

bluca commented Feb 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sigiesec commented Feb 10, 2019

bluca commented Feb 10, 2019

sigiesec commented Feb 11, 2019

bluca commented Feb 11, 2019

viciious commented Nov 7, 2019

bluca commented Nov 7, 2019

viciious commented Nov 7, 2019

viciious commented Nov 7, 2019 • edited Loading

bluca commented Feb 3, 2019 •

edited

Loading

sigiesec Feb 4, 2019 •

edited

Loading

viciious commented Nov 7, 2019 •

edited

Loading