-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception on shutdown #9
Comments
I am 99.9% certain this is in my changes to the RedisGraph dumping code and not the latest sc-driver, @yujenjuan & @mparivash1234 . The latest sc-driver recovers from this. I need to poke at more of the test suite and ensure it uses shutdown nosave or the Python equivalent. How to reproduce given the attached:
|
I spoke too soon. Something still is leaving this in the "application is already running" state. augh. I don't know how that didn't show up above... |
Please leave system in that hang state the next time you run into it so I can do a quick examination and figure out what the problem is. |
Ok, node 7 is in that state. |
I'd like to see when it gets into that state the first time, not recreated afterwards. Basically, the is a mutex that isn't given up when something crashes. |
I rebooted before recreating it. All I need for recreating it is the dump.rdb in /root/jriedy/hang-test . |
A simplified way to re-create this issue... From the directory in the attached file, |
When redis-server crashes w/ segfault, nothing is sent to sc-gc-api to tell sc-gc-api to shut things down. As a result, 2 named semaphores are left open in /dev/shm. I don't see a way for sc-gc-api to intercept such segfault signals and so I think you may have to clean up manually by doing |
It looks like the two semaphores in question could be replaced by files with locking. Those would clean themselves up on a crash. |
That's a good bit of low level surgery to do for sc-driver. I don't want to revisit this at this point with other priorities. |
If I find myself waiting on other things, I may poke at it. Assuming it's only used in sc-driver, there aren't many calls to sem_*. sem_wait becomes flock(fd, LOCK_EX) and sem_post becomes flock(fd, LOCK_UN). sem_trywait ors in LOCK_NB and checks errno. |
These two are not the semaphores used in sc-gc-api. These 2 are used at the lowest level of sc-driver to govern access to hardware. |
So what does deleting files in /dev/shm do? Those are only used in sc-driver. The x10_device one is low-level, I'm sure, but the gossamer64 one is what controls kernel launching. |
These 2 are not deleted by sc-driver programs at exit on purpose since I have no way of knowing whether there is another process that's running (for example, running emu_diagnostic_tool to peek & poke registers and memories while emu_handler_and_loader is running) and have these open. When everything behaves properly, the semaphores are closed on exit. What happens w/ redisgraph.so segfaulting is that the gossamer semaphore is open, but not closed (since it never got to run its exit function to close them. When you delete them from command line, you force them closed. Since there are no other processes using them, they get closed and deleted. These are then recreated and opened on next invocation of a sc-driver program. The x10_device semaphore governs individual register & memory accesses while the gossamer64 semaphore governs one level above it w/ gossamer interface. Both are used by sc-driver at low level and they are in many ways redundant, but since we are moving to pfi on x86 and a good amount of low level code will be completely different, I figure I'll think about whether we can do with just one. |
After examining the code further, I think the gossamer semaphore can be replaced with an advisory file lock easily, so I am going to make the change. |
Well, changing from semaphore to lock doesn't make it any more robust. When redis-server segfaults, there is always the possibility that sc-driver is in the middle of accessing a register on hardware and thus an x10 semaphore may not be given up. Since I've already done the work to convert gossamer semaphore to file locks, I am not reverting back. sc-driver is made to give you more clue that an interface isn't shutdown cleanly. You'll have to manually delete the lock file at /var/lock/gossamer64.lock and the semaphore /dev/shm/sem.x10_device_semaphore to clean up. If not, the next time you run, you could hang (but you can ctrl-C to get out) The bottom line is that the calling program has to be bug free. It should not be sc-driver or sc-gc-api's job to clean up when the caller crashes. |
We are sometimes seeing an exception on shutdown with the latest redisgraph. (@jasonriedy can chime in). For reference, I do not have this issue on condor6 with my old build.
condor6:n0: /root/skuntz/RG_bulk
redis-server v6.0.9
https://github.com/skuntz/RedisGraph/tree/skkBulkTiming
I'm not sure how to tell which sc-gc api version since it was compiled into the redisgraph.so May 26.
I tried two different LGB versions (May version, the one Minoo compiled a week or so ago) and both worked, so that doesn't appear to be the issue.
The text was updated successfully, but these errors were encountered: