Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failure in mailbox.cpp:82 #1108

Closed
hintjens opened this issue Jun 25, 2014 · 41 comments
Closed

assertion failure in mailbox.cpp:82 #1108

hintjens opened this issue Jun 25, 2014 · 41 comments

Comments

@hintjens
Copy link
Member

Environment

  • OS - MS Windows 7 Ultimate SP1,
  • Compiler - Visual Studio 2013
  • Version of ZeroMQ - 4.0.4, both variants: linking compiled dll and compile ZMQ source code.
  • We checked previous stable version 3.2.4 and development master on github - the same problem

How to reproduce

The C++ application is running and in several hours zmq_assert raises an exception in zmq::mailbox_t::recv line 82 and the application is crashed. Timeout isn't constant, could be 4 hours and more. It occurs even if the application has not had any zeromq tcp connections.

We found that ZeroMQ creates internal tcp connection (several sockets). It passes signals between zeromq threads or something else. But this is other sockets than for publisher-subsribers connection.

The exception raises when a disconnect event occurs in this internal tcp connection (got it in an network sniffer). We didn't find who initiated disconnect; we didn't find that ZeroMQ calls socket closing before.

The disconnect event changes the status of socket for WinSock select() method, it generates an read operation:

void zmq::select_t::loop ()
{
    . . .
    if (FD_ISSET (fds [i].fd, &readfds))
        fds [i].events->in_event ();          ß line 197
    . . .
}

http://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx

readfds:

  • If listen has been called and a connection is pending, accept will succeed.
  • Data is available for reading (includes OOB data if SO_OOBINLINE is enabled).
  • Connection has been closed/reset/terminated.

Then ZeroMQ is trying to read data from socket:

void zmq::io_thread_t::in_event ()
{
    . . .
    command_t cmd;
    int rc = mailbox.recv (&cmd, 0);                  ß line 69
    . . .
}

And it raises an exception on checks after reading because there is nothing to read:

int zmq::mailbox_t::recv (command_t *cmd_, int timeout_)
{
    . . .
    //  Get a command.
    errno_assert (rc == 0);
    bool ok = cpipe.read (cmd_);
    zmq_assert (ok);                                   ß line 82 (crashed here)
    return 0;
}
@hurtonm hurtonm self-assigned this Jul 1, 2014
hurtonm added a commit to hurtonm/libzmq that referenced this issue Jul 6, 2014
- new code may help undersdtand issue zeromq#1108
  (zeromq#1108)
- code cleanups
@hurtonm
Copy link
Member

hurtonm commented Jul 7, 2014

Could you try to reproduce the bug with the current libzmq master? Thanks.

@aleksndr
Copy link

It seems the problem is solved. We could not reproduce this bug with the current master. Thank you

@sercxanto
Copy link

I experienced the same problem with this exception.

But after backporting the fix locally to 4.0.4 I now get a different assertion in signaler.c :

Assertion failed: pfd.revents & POLLIN
(.../src/signaler.cpp:193)

@hurtonm
Copy link
Member

hurtonm commented Aug 15, 2014

Is this on Windows OS?

@sercxanto
Copy link

No, its on Ubuntu 12.04 LTS.

I just happened again, here is the call stack in the relevant ZeroMQ part:

#0  0xb779e424 in __kernel_vsyscall ()
#1  0xb55db20f in __GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2  0xb55de855 in __GI_abort () at abort.c:91
#3  0xb5900f80 in zmq::zmq_abort(char const*) ()
from .../sys/lib/libzmq.so
#4  0xb591a4e8 in zmq::signaler_t::wait(int) ()
from .../sys/lib/libzmq.so
#5  0xb5904ca3 in zmq::mailbox_t::recv(zmq::command_t*, int) ()
from .../sys/lib/libzmq.so
#6  0xb591dcb1 in zmq::socket_base_t::process_commands(int, bool) ()
from .../sys/lib/libzmq.so
#7  0xb591b891 in zmq::socket_base_t::getsockopt(int, void*, unsigned int*) ()
from .../sys/lib/libzmq.so
#8  0xb5935438 in zmq_getsockopt () from .../sys/lib/libzmq.so
#9  0xb593684d in zmq_poll () from .../sys/lib/libzmq.so
#10 0xb59763e4 in zloop_start () from .../sys/lib/libczmq.so

Unfortunately I don't know the exact use case. It just happens from time to time. libzmq is 4.0.4 and all sockets are inproc ZMQ_PAIR.

@hurtonm
Copy link
Member

hurtonm commented Aug 15, 2014

How often does this happen? Could you run your application under strace so that I can look what's happening? Thanks.

@sercxanto
Copy link

While investigating I found another bug in our own part of the software. The bug lead to ressource exhaustion, file handles were running out.

Once I fixed that bug I could not reproduce the assertion. Prior to my fix the assertion occured every hour or so, now the software is running for several hours.

So currently the software runs fine with the backported ZeroMQ and my own fix.

This has been 10 days ago. As I also heard no complains from colleagues I considers this not being an issue anymore.

@allendrennan
Copy link

I can reproduce this error if I accidentally call zmsg_send against socket from one thread while calling zmq_poll against the same socket from another thread. According the ZMQ documentation, this is not currently allowed. This is an easy mistake to make if you expect your application to call send() from the primary application thread and have another asynchronous thread handling receive(). If you need to create an asynchronous send and receive, then you should perform both operations in the same thread, first zmq_poll() with a low timeout (5 or 10ms) followed by all your zmsq_send operations that are pending in a single pass. This should be a queue that is protected so that your main application thread can write it and the receive thread can read it.

Additionally you need to make sure you don't create or destroy your sockets in the constructor for your class, but instead in the context of the thread. That way the socket will be entirely managed by the thread. I have found this approach eliminates the mailbox.cpp error.

If anyone knows another model that works for asynchronous handling in ZeroMQ that is thread safe, I would love to hear it.

@sercxanto
Copy link

If anyone knows another model that works for asynchronous handling in ZeroMQ that is thread safe, I would love to hear it.

I don't know about a better solution, but I had to to solve the same problem when interfacing with a third party library which makes use of callback functions. As I don't know in which thread context our callback functions are called I use the following approach:

  • Thirdparty lib calls callback function
  • Callback function creates struct with request/response data on heap and passes pointer to a message pipe
  • A dedicate receive thread reads the message pipe in a loop and passes the heap pointer to ZeroMQ socket
  • The ZeroMQ socket in the main thread reads the data and calls the structs destructor.
  • As all happens to the same thread the request/response data has to be parsed only once and the pointer is passed around.

So in fact I have a proxy thread in between the callback function and ZeroMQ. I use an internal pipe implementation but recently stumbled accros pipe.c. Basically it is a thread safe queue.

@mdkcore0
Copy link

mdkcore0 commented Jan 6, 2015

I got this error sometimes, on a MacOS X (yosemite), zmq 4.0.4 is running under python bindings

@rodgert
Copy link
Contributor

rodgert commented Jan 6, 2015

Any sort of stack trace? Without it, there's essentially no way to diagnose.

@mdkcore0
Copy link

mdkcore0 commented Jan 7, 2015

Yup, got this one right now: http://pastebin.com/fRH5paTm

@rodgert
Copy link
Contributor

rodgert commented Jan 7, 2015

It seems most likely it's this call at line 81, triggering the assert (there seem to be a couple of different flavors of assert in this discussion, but this is consistent with hintjens original bug) -

bool ok = cpipe.read (cmd_);
zmq_assert (ok);

A data race in ypipe_t<>?

Though I'm not sure what it would be, the x86 implementation of CAS seems fine -

        T *old;
        __asm__ volatile (
            "lock; cmpxchg %2, %3"
            : "=a" (old), "=m" (ptr)
            : "r" (val_), "m" (ptr), "0" (cmp_)
            : "cc");
        return old;

GCC emits "lock: cmpxchgq" (q for quad word) for C11/C++11 atomic cas on x86_64, but it would seem the resulting opcodes for cmpxchgq and this implementation are the same 'f0 48 0f b1' so I doubt this is a problem.

@allendrennan
Copy link

As I mentioned in the past (above), it can easily be reproduced by calling the same socket in ZMQ from 2 different threads. This is not allowed. Could that be happening between Thread 0 and Thread 9 in your stack dump?

@giskook
Copy link

giskook commented Jan 26, 2015

@hurtonm @allendrennan I use centos 6.6 with zmq-4.0.4. I have two threads and two sockets one for push the message to server. the other one collect message from server with sub. the two sockets are only used in their thread.Is it ok? my program crash .
======= Backtrace: =========
/lib64/libc.so.6[0x38fb075e66]
/usr/local/lib/libzmq.so.4(+0x1d933)[0x7f8cd4a80933]
/usr/local/lib/libzmq.so.4(+0x3dfc9)[0x7f8cd4aa0fc9]
/usr/local/lib/libzmq.so.4(+0x324a3)[0x7f8cd4a954a3]
/usr/local/lib/libzmq.so.4(+0x29f88)[0x7f8cd4a8cf88]
/usr/local/lib/libzmq.so.4(+0x19384)[0x7f8cd4a7c384]
/usr/local/lib/libzmq.so.4(+0x1848e)[0x7f8cd4a7b48e]
/usr/local/lib/libzmq.so.4(+0x37a46)[0x7f8cd4a9aa46]

@wcs1only
Copy link
Contributor

@rodgert I'm not sure that logic follows. Every existing CAS implementation for x86_64 that I have been able to find is using the quad word instruction. The assembler may emit the same opcode, but does it choose the same registers? If it chooses 32-bit registers to store/read the new value, that would be trouble. It might be making the assumption that it can safely use a 32-bit register for this 32-bit instruction. The effect of this, if we were truncating the CAS operation to 32-bits, would be to corrupt the pointer address if the pointer we were swapping in had a different set of 32 most significant bits. This would only appear in programs whose address space was >4G (or when swapping a stack pointer for a heap pointer or vice versa).

This is consistent with the context in which I encounter this particular abort. I hit it about once every two months on 500+ servers. Every time I do, the resulting core is >4G indicating that we may very well have pointers floating around with different most significant bits.

I'm preparing a patch that I believe should address the issue and I will make a pull request as soon as I am confident in the solution.

@rodgert
Copy link
Contributor

rodgert commented Aug 25, 2015

Hmmm, perhaps. See -

http://goo.gl/tUFwPR

Playing around with -m32/-m64 certainly causes GCC to make different register choices, but the 64 bit variant still looks ok -

mov rdx, QWORD PTR [rsi]
mov rax, QWORD PTR [rsi+8]
mov QWORD PTR [rsp-8], rdx
lock; cmpxchg rdx, QWORD PTR [rsp-8]

@wcs1only
Copy link
Contributor

Neat tool, I'll have to bookmark that one. So, there goes that theory, but I'm still not convinced that the cmpxchgq instruction is unneeded on 64-bit platforms. Why did they even bother to add the 'cmpxchgq' instruction if the existing instruction correctly does a 64-bit CAS? Let me do some more research and a test or two.

@rodgert
Copy link
Contributor

rodgert commented Aug 25, 2015

Thanks, a friend of mine put it together, I'll let him know he's got
another fan :)

FWIW, I put code in to use the GCC intrinsics instead of the hand rolled
x86 assembly if autoconf detects them (should work for recent-ish GCC and
Clang). I can't recall offhand if that change made the 4.1 cutoff though
and I'm not currently in a position to check.

On Tuesday, August 25, 2015, Charlie Stanley notifications@github.com
wrote:

Neat tool, I'll have to bookmark that one. So, there goes that theory, but
I'm still not convinced that the cmpxchgq instruction is unneeded on 64-bit
platforms. Why did they even bother to add the 'cmpxchgq' instruction if
the existing instruction correctly does a 64-bit CAS? Let me do some more
research and a test or two.


Reply to this email directly or view it on GitHub
#1108 (comment).

@wcs1only
Copy link
Contributor

wcs1only commented Sep 7, 2015

So, I've tested this out every which way, and it appears that cmpxchg and cmpxchgq behave exactly the same way as you stated earlier. Forgive me for bringing you down this rabbit hole.

So back to square one on the assert. We're still hitting this one on 4.1.2. It is exceedingly rare, still once every couple of months or so.

The only other thing that comes to mind is the ABA problem. Have you any protections against that in your CAS implementation?

I'll post a fresh core the next time one comes along, though it may be a bit.

@rodgert
Copy link
Contributor

rodgert commented Sep 8, 2015

It is just a bare CAS (ypipe_t::c and yqueue_t::spare_chunk), there are none of the "usual suspects" in the way of ABA prevention AFAICT in the code.

@pijyoi
Copy link
Contributor

pijyoi commented Sep 8, 2015

Is the following situation possible:
2 signaler events are waiting and 2 cmds are in cpipe. We consume 1 signaler event and 2 cmds. Now the 2nd signaler event has no corresponding cmd in cpipe.

If yes, cpipe being empty is legitimate and we could just retry instead of asserting?

@JamesMGreene
Copy link

I am getting this same error on Windows 10 with libzmq 4.1.3:

Assertion failed: ok (..\..\..\src\mailbox.cpp:82)

@jimkeir
Copy link

jimkeir commented May 15, 2016

Also receiving this fairly frequently on iOS. All zmq accesses are from the same thread.

@somdoron
Copy link
Member

@jimkeir @JamesMGreene can you prodive stack trace?

@jimkeir
Copy link

jimkeir commented May 15, 2016

Hi,

Thanks for the reply. Certainly can:

0  libsystem_kernel.dylib         0x2140ec5c __pthread_kill + 8
1  libsystem_pthread.dylib        0x214b8733 pthread_kill + 62
2  libsystem_c.dylib              0x213a30ad abort + 108
3  EVA                            0x11a09f (Missing)
4  EVA                            0x11d889 zmq::mailbox_t::recv(zmq::command_t*, int) (mailbox.cpp:82)
5  EVA                            0x12b8cb zmq::socket_base_t::process_commands(int, bool) (socket_base.cpp:914)
6  EVA                            0x12b7a9 zmq::socket_base_t::getsockopt(int, void*, unsigned long*) (socket_base.cpp:295)
7  EVA                            0x138329 zmq_poll (zmq.cpp:758)
8  EVA                            0xb8197 __28-[EDDNHandler startDownload]_block_invoke (EDDNHandler.m:264)

... and ...

Crashed: com.apple.root.utility-qos
0  libsystem_kernel.dylib         0x182d9011c __pthread_kill + 8
1  libsystem_pthread.dylib        0x182e5cef8 pthread_kill + 112
2  libsystem_c.dylib              0x182d01dac abort + 140
3  EVA                            0x1000f493c zmq::fq_t::fq_t() (fq.cpp:29)
4  EVA                            0x1000f8458 zmq::mailbox_t::recv(zmq::command_t*, int) (mailbox.cpp:82)
5  EVA                            0x10010973c zmq::socket_base_t::process_commands(int, bool) (socket_base.cpp:914)
6  EVA                            0x10010958c zmq::socket_base_t::getsockopt(int, void*, unsigned long*) (socket_base.cpp:295)
7  EVA                            0x10011622c zmq_poll (zmq.cpp:758)
8  EVA                            0x10008bf40 __28-[EDDNHandler startDownload]_block_invoke (EDDNHandler.m:264)

This is using the zeromq4-x branch. I've only ever seen this happening in release code (i.e. to other people), so I've not been able to do any real diagnostics myself. The relevant code is all running from a single thread, and I'm receiving messages from a public source so definitely not sending myself. This also means that if the issue is on the sending end, there's nowt I can do about it.

The section of code is:

            zmq_pollitem_t pollItems[] = {
                { zmq_Socket, 0, ZMQ_POLLIN, 0 },
                { NULL, zmq_AbortSocket[0], ZMQ_POLLIN, 0 },
            };

            while (1)
            {
                zmq_msg_t message;

                rc = zmq_poll(pollItems, 2, -1);

                if (pollItems[0].revents && ZMQ_POLLIN)
                {
                    rc = zmq_msg_init (&message);
                    assert (rc == 0);

                    if (-1 != zmq_msg_recv(&message, zmq_Socket, 0))
                    {

zmq_AbortSocket is a regular socket created using "socketpair", to signal exit from another thread.

Cheers,
Jim

@somdoron
Copy link
Member

somdoron commented May 16, 2016

@jimkeir I'm actually interesting to see the returned errno, can you reproduce the issue? if you can I will create a log message before the assert to see the errno. can you compile from source?

@jimkeir
Copy link

jimkeir commented May 16, 2016

Hi,

Yes, I'm compiling from source.

I'm actually interesting to see the returned errno

The return code from zmq_poll is irrelevant because zmq aborts, and from what I can see in the zmq code in mailbox.cpp and ypipe.hpp, the failure is a true/false decision based on the contents of a queue rather than anything that would set errno. Still, I'm now adding errno to the crash reporter so if there is something there, it will be reported.

I've modified mailbox.cpp so that it returns -1 instead of using abort(), and added some logs to my code so that I can see how long it's been running to hopefully work out whether it's related to startup/shutdown. It's not something I can reproduce on demand though, so I won't get any more info until I publish a new version of the app. I'll feed back then. I won't be publishing for a little while, so if there's any more state you'd like captured, please let me know.

@schandok
Copy link

Hitting same issue w/ 4.0.4. @hurtonm can you please suggest the version that has the fix?

@jimkeir
Copy link

jimkeir commented Jun 13, 2018

I've just pushed to here: https://github.com/jimkeir/zeromq4-x . It simply changes a zmq_assert to a normal error code return. If you follow the chain back up, comments for one function say that it's normal during shutdown for it to fail and I suspect this is what's happening.

@mxcoppell
Copy link

In our case, the socket disconnect is definitely caused by Windows KeepAlive logic - many tests showed that it happened at 7200s into the traffic run.

We tried to set ZMQ_TCP_KEEPALIVE_IDLE (with a smaller value like 60s) through ZMQ socket "setsocketopt" to see if the crash interval changes - traced into ZMQ code and ZMQ does call the WinSock IOCtl function to set the parameter but it didn't work.

No sure if turning off Windows socket KeepAlive helps, haven't tried option ZMQ_TCP_KEEPALIVE to 0.

@jimkeir By replacing the zmq_assert to error code return, does ZMQ handle the socket reconnect and no crash?

@bluca
Copy link
Member

bluca commented Aug 7, 2018

Don't mess with asserts. If the TCP keepalive options don't work, try with the heartbeat options - they are the equivalent but at the zmtp protocol level, so bad network stack implementations don't mess with them.

@mxcoppell
Copy link

Thanks @bluca, but I don't think either KEEPALIVE or HEARTBEAT options for ZMQ socket is related to the issue.

In our test, the payload socket is full of live data during traffic run. The idle socket got killed after 7200s (Windows KeepAlive default idle time) is some internal socket used by ZMQ for commanding or management that is not exposed to application code.

So, from application code, we tried to set the socket KEEPALIVE options and we can see it's been taken by WinSock IOCtl calls correctly but it doesn't make any difference. The "Reset by Peer" still happens after exactly 7200s.

If you run TcpView and look at a running ZMQ application with PUB-SUB model, a lot of socket connections with port numbers in 5XXXX range are used by ZMQ. There is no interface for application code to set options for these sockets. But for us, changing the system wide WinSock KA settings is not an options, they are end-user's computer, not in the server farm.

We are trying to test REQ-REP model only (much simpler than PUB-SUB) to see if there is any difference.

@bluca
Copy link
Member

bluca commented Aug 7, 2018

Right, that will be the internal pipes - then yeah that won't help.

@stale
Copy link

stale bot commented Aug 7, 2019

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

@stale stale bot added the stale label Aug 7, 2019
@yuekunjava
Copy link

yuekunjava commented Aug 19, 2019

we encounter this issue on arm64(armv8) with 4.2.5, assert failed by cpipe. read(cmd_) , it occured repeatedly, could anyone help to check ,thanks.

@stale stale bot removed the stale label Aug 19, 2019
@yuekunjava
Copy link

the callstack is as follows:
#0 0x0000ffff8ee0e36c in abort () from /lib64/libc.so.6
#1 0x0000ffff8faf0888 in zmq::zmq_abort (errmsg_=0xffff8fb54d60 "ok") at src/err.cpp:87
#2 0x0000ffff8faf677c in zmq::mailbox_t::recv (this=0xffff74001280, cmd_=0xffff7e7fb640, timeout_=0)
at src/mailbox.cpp:99
#3 0x0000ffff8fb23cd0 in zmq::socket_base_t::process_commands (this=0xffff74000b20, timeout_=0, throttle_=false)
at src/socket_base.cpp:1341
#4 0x0000ffff8fb207b4 in zmq::socket_base_t::getsockopt (this=0xffff74000b20, option_=15, optval_=0xffff7e7fb75c,
optvallen_=0xffff7e7fb760) at src/socket_base.cpp:413
#5 0x0000ffff8fb4cc78 in zmq::socket_poller_t::check_events (this=0xffff7e7fb880, events_=0xffff741208d0,
n_events_=24) at src/socket_poller.cpp:422
#6 0x0000ffff8fb4d1b4 in zmq::socket_poller_t::wait (this=0xffff7e7fb880, events_=0xffff741208d0, n_events_=24,
timeout_=-1) at src/socket_poller.cpp:588
#7 0x0000ffff8fb4a3fc in zmq_poller_wait_all (poller_=0xffff7e7fb880, events_=0xffff741208d0, n_events=24,
timeout_=-1) at src/zmq.cpp:1339
#8 0x0000ffff8fb4adb0 in zmq_poller_poll (items_=0xffff8fc57678 <g_astZMQItems>, nitems_=24, timeout_=-1)
at src/zmq.cpp:749

#9 0x0000ffff8fb49c3c in zmq_poll (items_=0xffff8fc57678 <g_astZMQItems>, nitems_=24, timeout_=-1) at src/zmq.cpp:796

@bluca
Copy link
Member

bluca commented Aug 20, 2019

Are you referencing a socket from multiple threads?

@yuekunjava
Copy link

yuekunjava commented Aug 20, 2019

thanks for your reply;
i used it in one thread, and the same code worked fine in x86;
this Macro "ZMQ_HAVE_ATOMIC_INTRINSICS" take effect in cas mechanism

@stale
Copy link

stale bot commented Aug 21, 2020

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions.

@stale stale bot added the stale label Aug 21, 2020
@stale stale bot closed this as completed Oct 23, 2020
@edwinvp
Copy link

edwinvp commented Feb 4, 2021

I can reproduce the issue by closing those "local" TCP connections with TCPview with 4.0.10. But I have since upgraded to libzmq 4.3.x in which the internal connections seem to be implemented differently (for the same program I now have much less TCP connections than before). Also, any external close on the remaining sockets no longer causes an assertion failure.

However, our application still crashes with 4.3.x. Turns our code had a bug that was closing the same HANDLE twice (or more). And since Windows immediately reuses HANDLEs after you close them, anything that's opened after the close action is at risk (and that includes zeromq).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests