-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assertion failure in mailbox.cpp:82 #1108
Comments
- new code may help undersdtand issue zeromq#1108 (zeromq#1108) - code cleanups
Could you try to reproduce the bug with the current libzmq master? Thanks. |
It seems the problem is solved. We could not reproduce this bug with the current master. Thank you |
I experienced the same problem with this exception. But after backporting the fix locally to 4.0.4 I now get a different assertion in signaler.c :
|
Is this on Windows OS? |
No, its on Ubuntu 12.04 LTS. I just happened again, here is the call stack in the relevant ZeroMQ part:
Unfortunately I don't know the exact use case. It just happens from time to time. libzmq is 4.0.4 and all sockets are inproc ZMQ_PAIR. |
How often does this happen? Could you run your application under strace so that I can look what's happening? Thanks. |
While investigating I found another bug in our own part of the software. The bug lead to ressource exhaustion, file handles were running out. Once I fixed that bug I could not reproduce the assertion. Prior to my fix the assertion occured every hour or so, now the software is running for several hours. So currently the software runs fine with the backported ZeroMQ and my own fix. This has been 10 days ago. As I also heard no complains from colleagues I considers this not being an issue anymore. |
I can reproduce this error if I accidentally call zmsg_send against socket from one thread while calling zmq_poll against the same socket from another thread. According the ZMQ documentation, this is not currently allowed. This is an easy mistake to make if you expect your application to call send() from the primary application thread and have another asynchronous thread handling receive(). If you need to create an asynchronous send and receive, then you should perform both operations in the same thread, first zmq_poll() with a low timeout (5 or 10ms) followed by all your zmsq_send operations that are pending in a single pass. This should be a queue that is protected so that your main application thread can write it and the receive thread can read it. Additionally you need to make sure you don't create or destroy your sockets in the constructor for your class, but instead in the context of the thread. That way the socket will be entirely managed by the thread. I have found this approach eliminates the mailbox.cpp error. If anyone knows another model that works for asynchronous handling in ZeroMQ that is thread safe, I would love to hear it. |
I don't know about a better solution, but I had to to solve the same problem when interfacing with a third party library which makes use of callback functions. As I don't know in which thread context our callback functions are called I use the following approach:
So in fact I have a proxy thread in between the callback function and ZeroMQ. I use an internal pipe implementation but recently stumbled accros pipe.c. Basically it is a thread safe queue. |
I got this error sometimes, on a MacOS X (yosemite), zmq 4.0.4 is running under python bindings |
Any sort of stack trace? Without it, there's essentially no way to diagnose. |
Yup, got this one right now: http://pastebin.com/fRH5paTm |
It seems most likely it's this call at line 81, triggering the assert (there seem to be a couple of different flavors of assert in this discussion, but this is consistent with hintjens original bug) -
A data race in ypipe_t<>? Though I'm not sure what it would be, the x86 implementation of CAS seems fine -
GCC emits "lock: cmpxchgq" (q for quad word) for C11/C++11 atomic cas on x86_64, but it would seem the resulting opcodes for cmpxchgq and this implementation are the same 'f0 48 0f b1' so I doubt this is a problem. |
As I mentioned in the past (above), it can easily be reproduced by calling the same socket in ZMQ from 2 different threads. This is not allowed. Could that be happening between Thread 0 and Thread 9 in your stack dump? |
@hurtonm @allendrennan I use centos 6.6 with zmq-4.0.4. I have two threads and two sockets one for push the message to server. the other one collect message from server with sub. the two sockets are only used in their thread.Is it ok? my program crash . |
@rodgert I'm not sure that logic follows. Every existing CAS implementation for x86_64 that I have been able to find is using the quad word instruction. The assembler may emit the same opcode, but does it choose the same registers? If it chooses 32-bit registers to store/read the new value, that would be trouble. It might be making the assumption that it can safely use a 32-bit register for this 32-bit instruction. The effect of this, if we were truncating the CAS operation to 32-bits, would be to corrupt the pointer address if the pointer we were swapping in had a different set of 32 most significant bits. This would only appear in programs whose address space was >4G (or when swapping a stack pointer for a heap pointer or vice versa). This is consistent with the context in which I encounter this particular abort. I hit it about once every two months on 500+ servers. Every time I do, the resulting core is >4G indicating that we may very well have pointers floating around with different most significant bits. I'm preparing a patch that I believe should address the issue and I will make a pull request as soon as I am confident in the solution. |
Hmmm, perhaps. See - Playing around with -m32/-m64 certainly causes GCC to make different register choices, but the 64 bit variant still looks ok -
|
Neat tool, I'll have to bookmark that one. So, there goes that theory, but I'm still not convinced that the cmpxchgq instruction is unneeded on 64-bit platforms. Why did they even bother to add the 'cmpxchgq' instruction if the existing instruction correctly does a 64-bit CAS? Let me do some more research and a test or two. |
Thanks, a friend of mine put it together, I'll let him know he's got FWIW, I put code in to use the GCC intrinsics instead of the hand rolled On Tuesday, August 25, 2015, Charlie Stanley notifications@github.com
|
So, I've tested this out every which way, and it appears that cmpxchg and cmpxchgq behave exactly the same way as you stated earlier. Forgive me for bringing you down this rabbit hole. So back to square one on the assert. We're still hitting this one on 4.1.2. It is exceedingly rare, still once every couple of months or so. The only other thing that comes to mind is the ABA problem. Have you any protections against that in your CAS implementation? I'll post a fresh core the next time one comes along, though it may be a bit. |
It is just a bare CAS (ypipe_t::c and yqueue_t::spare_chunk), there are none of the "usual suspects" in the way of ABA prevention AFAICT in the code. |
Is the following situation possible: If yes, cpipe being empty is legitimate and we could just retry instead of asserting? |
I am getting this same error on Windows 10 with libzmq 4.1.3:
|
Also receiving this fairly frequently on iOS. All zmq accesses are from the same thread. |
@jimkeir @JamesMGreene can you prodive stack trace? |
Hi, Thanks for the reply. Certainly can:
... and ...
This is using the zeromq4-x branch. I've only ever seen this happening in release code (i.e. to other people), so I've not been able to do any real diagnostics myself. The relevant code is all running from a single thread, and I'm receiving messages from a public source so definitely not sending myself. This also means that if the issue is on the sending end, there's nowt I can do about it. The section of code is:
zmq_AbortSocket is a regular socket created using "socketpair", to signal exit from another thread. Cheers, |
@jimkeir I'm actually interesting to see the returned errno, can you reproduce the issue? if you can I will create a log message before the assert to see the errno. can you compile from source? |
Hi, Yes, I'm compiling from source.
The return code from zmq_poll is irrelevant because zmq aborts, and from what I can see in the zmq code in mailbox.cpp and ypipe.hpp, the failure is a true/false decision based on the contents of a queue rather than anything that would set errno. Still, I'm now adding errno to the crash reporter so if there is something there, it will be reported. I've modified mailbox.cpp so that it returns -1 instead of using abort(), and added some logs to my code so that I can see how long it's been running to hopefully work out whether it's related to startup/shutdown. It's not something I can reproduce on demand though, so I won't get any more info until I publish a new version of the app. I'll feed back then. I won't be publishing for a little while, so if there's any more state you'd like captured, please let me know. |
Hitting same issue w/ 4.0.4. @hurtonm can you please suggest the version that has the fix? |
I've just pushed to here: https://github.com/jimkeir/zeromq4-x . It simply changes a zmq_assert to a normal error code return. If you follow the chain back up, comments for one function say that it's normal during shutdown for it to fail and I suspect this is what's happening. |
In our case, the socket disconnect is definitely caused by Windows KeepAlive logic - many tests showed that it happened at 7200s into the traffic run. We tried to set ZMQ_TCP_KEEPALIVE_IDLE (with a smaller value like 60s) through ZMQ socket "setsocketopt" to see if the crash interval changes - traced into ZMQ code and ZMQ does call the WinSock IOCtl function to set the parameter but it didn't work. No sure if turning off Windows socket KeepAlive helps, haven't tried option ZMQ_TCP_KEEPALIVE to 0. @jimkeir By replacing the zmq_assert to error code return, does ZMQ handle the socket reconnect and no crash? |
Don't mess with asserts. If the TCP keepalive options don't work, try with the heartbeat options - they are the equivalent but at the zmtp protocol level, so bad network stack implementations don't mess with them. |
Thanks @bluca, but I don't think either KEEPALIVE or HEARTBEAT options for ZMQ socket is related to the issue. In our test, the payload socket is full of live data during traffic run. The idle socket got killed after 7200s (Windows KeepAlive default idle time) is some internal socket used by ZMQ for commanding or management that is not exposed to application code. So, from application code, we tried to set the socket KEEPALIVE options and we can see it's been taken by WinSock IOCtl calls correctly but it doesn't make any difference. The "Reset by Peer" still happens after exactly 7200s. If you run TcpView and look at a running ZMQ application with PUB-SUB model, a lot of socket connections with port numbers in 5XXXX range are used by ZMQ. There is no interface for application code to set options for these sockets. But for us, changing the system wide WinSock KA settings is not an options, they are end-user's computer, not in the server farm. We are trying to test REQ-REP model only (much simpler than PUB-SUB) to see if there is any difference. |
Right, that will be the internal pipes - then yeah that won't help. |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
we encounter this issue on arm64(armv8) with 4.2.5, assert failed by cpipe. read(cmd_) , it occured repeatedly, could anyone help to check ,thanks. |
the callstack is as follows: #9 0x0000ffff8fb49c3c in zmq_poll (items_=0xffff8fc57678 <g_astZMQItems>, nitems_=24, timeout_=-1) at src/zmq.cpp:796 |
Are you referencing a socket from multiple threads? |
thanks for your reply; |
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 56 days. Thank you for your contributions. |
I can reproduce the issue by closing those "local" TCP connections with TCPview with 4.0.10. But I have since upgraded to libzmq 4.3.x in which the internal connections seem to be implemented differently (for the same program I now have much less TCP connections than before). Also, any external close on the remaining sockets no longer causes an assertion failure. However, our application still crashes with 4.3.x. Turns our code had a bug that was closing the same HANDLE twice (or more). And since Windows immediately reuses HANDLEs after you close them, anything that's opened after the close action is at risk (and that includes zeromq). |
Environment
How to reproduce
The C++ application is running and in several hours zmq_assert raises an exception in zmq::mailbox_t::recv line 82 and the application is crashed. Timeout isn't constant, could be 4 hours and more. It occurs even if the application has not had any zeromq tcp connections.
We found that ZeroMQ creates internal tcp connection (several sockets). It passes signals between zeromq threads or something else. But this is other sockets than for publisher-subsribers connection.
The exception raises when a disconnect event occurs in this internal tcp connection (got it in an network sniffer). We didn't find who initiated disconnect; we didn't find that ZeroMQ calls socket closing before.
The disconnect event changes the status of socket for WinSock select() method, it generates an read operation:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms740141(v=vs.85).aspx
readfds:
Then ZeroMQ is trying to read data from socket:
And it raises an exception on checks after reading because there is nothing to read:
The text was updated successfully, but these errors were encountered: