Exception on shutdown #9

skuntz · 2021-11-11T22:45:43Z

We are sometimes seeing an exception on shutdown with the latest redisgraph. (@jasonriedy can chime in). For reference, I do not have this issue on condor6 with my old build.

condor6:n0: /root/skuntz/RG_bulk
redis-server v6.0.9
https://github.com/skuntz/RedisGraph/tree/skkBulkTiming
I'm not sure how to tell which sc-gc api version since it was compiled into the redisgraph.so May 26.
I tried two different LGB versions (May version, the one Minoo compiled a week or so ago) and both worked, so that doesn't appear to be the issue.

jasonriedy · 2021-11-15T13:50:51Z

I am 99.9% certain this is in my changes to the RedisGraph dumping code and not the latest sc-driver, @yujenjuan & @mparivash1234 . The latest sc-driver recovers from this. I need to poke at more of the test suite and ensure it uses shutdown nosave or the Python equivalent.

How to reproduce given the attached:
hang-test.tar.gz

. ./condor6-setup.sh

# rsync -a /var/log/emutechnology log-0
./redis-server --loadmodule module/redisgraph.so >server-log.1 2>&1 &
sleep 30
cat <<EOF | redis-cli # This works.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
shutdown
EOF
wait
ls -l dump.rdb 

# rsync -a /var/log/emutechnology log-1
./redis-server --loadmodule module/redisgraph.so >server-log.2 2>&1 &
sleep 30
cat <<EOF | redis-cli # This dies.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
graph.query test "MATCH (src:person) CREATE (src)-[e:knows]->(dest {name:'C'}) RETURN src,e,dest ORDER BY ID(src) DESC LIMIT 1"
shutdown
EOF
wait
rm -f dump.rdb

# rsync -a /var/log/emutechnology log-2
./redis-server --loadmodule module/redisgraph.so >server-log.3 2>&1 &
sleep 30
cat <<EOF | redis-cli # This also dies.
graph.query test "CREATE (a:person {name:'A'}), (b:person {name:'B'})"
graph.query test "MATCH (src:person) CREATE (src)-[e:knows]->(dest {name:'C'}) RETURN src,e,dest ORDER BY ID(src) DESC LIMIT 1"
shutdown
EOF
wait

./redis-server --loadmodule module/redisgraph.so 2>&1 | tee server-log.4 # crashes because dump.rdb is junk.

jasonriedy · 2021-11-15T14:31:37Z

I spoke too soon. Something still is leaving this in the "application is already running" state. augh. I don't know how that didn't show up above...
OH! That last segfault is after loading the redisgraph.so module, so after the mwx side is run. But then redis crashes. It's fancy ASCII graphic crash thingy runs, but it doesn't bother to unload the modules. So I need to fetch clean logs before and after that step if needed.
So sc-driver needs to track the process using it, or we need a SC-side wrapper like multinode_exec that handles things properly when the child dies.

yujenjuan · 2021-11-15T14:34:36Z

Please leave system in that hang state the next time you run into it so I can do a quick examination and figure out what the problem is.

jasonriedy · 2021-11-15T14:39:04Z

Ok, node 7 is in that state.

yujenjuan · 2021-11-15T14:42:03Z

I'd like to see when it gets into that state the first time, not recreated afterwards. Basically, the is a mutex that isn't given up when something crashes.

jasonriedy · 2021-11-15T14:43:01Z

I rebooted before recreating it. All I need for recreating it is the dump.rdb in /root/jriedy/hang-test .

jasonriedy · 2021-11-15T20:44:30Z

A simplified way to re-create this issue... From the directory in the attached file, ./redis-server --loadmodule module/redisgraph.so . That will try to load the broken dump file, will segfault, and will leave the SC/GC interface unable to run another file. Probably needs the LD_LIBRARY_PATH addition.
hang-test-simplified.tar.gz

yujenjuan · 2021-11-15T20:56:39Z

When redis-server crashes w/ segfault, nothing is sent to sc-gc-api to tell sc-gc-api to shut things down. As a result, 2 named semaphores are left open in /dev/shm. I don't see a way for sc-gc-api to intercept such segfault signals and so I think you may have to clean up manually by doing rm -f /dev/shm/*

jasonriedy · 2021-11-16T00:58:38Z

It looks like the two semaphores in question could be replaced by files with locking. Those would clean themselves up on a crash.

yujenjuan · 2021-11-16T01:09:59Z

That's a good bit of low level surgery to do for sc-driver. I don't want to revisit this at this point with other priorities.

jasonriedy · 2021-11-16T01:21:04Z

If I find myself waiting on other things, I may poke at it. Assuming it's only used in sc-driver, there aren't many calls to sem_*. sem_wait becomes flock(fd, LOCK_EX) and sem_post becomes flock(fd, LOCK_UN). sem_trywait ors in LOCK_NB and checks errno.

yujenjuan · 2021-11-16T01:23:54Z

These two are not the semaphores used in sc-gc-api. These 2 are used at the lowest level of sc-driver to govern access to hardware.

jasonriedy · 2021-11-16T01:28:32Z

So what does deleting files in /dev/shm do? Those are only used in sc-driver. The x10_device one is low-level, I'm sure, but the gossamer64 one is what controls kernel launching.

yujenjuan · 2021-11-16T02:01:38Z

These 2 are not deleted by sc-driver programs at exit on purpose since I have no way of knowing whether there is another process that's running (for example, running emu_diagnostic_tool to peek & poke registers and memories while emu_handler_and_loader is running) and have these open. When everything behaves properly, the semaphores are closed on exit.

What happens w/ redisgraph.so segfaulting is that the gossamer semaphore is open, but not closed (since it never got to run its exit function to close them. When you delete them from command line, you force them closed. Since there are no other processes using them, they get closed and deleted. These are then recreated and opened on next invocation of a sc-driver program.

The x10_device semaphore governs individual register & memory accesses while the gossamer64 semaphore governs one level above it w/ gossamer interface. Both are used by sc-driver at low level and they are in many ways redundant, but since we are moving to pfi on x86 and a good amount of low level code will be completely different, I figure I'll think about whether we can do with just one.

yujenjuan · 2021-11-16T13:09:05Z

After examining the code further, I think the gossamer semaphore can be replaced with an advisory file lock easily, so I am going to make the change.

yujenjuan · 2021-11-16T15:57:08Z

Well, changing from semaphore to lock doesn't make it any more robust. When redis-server segfaults, there is always the possibility that sc-driver is in the middle of accessing a register on hardware and thus an x10 semaphore may not be given up. Since I've already done the work to convert gossamer semaphore to file locks, I am not reverting back.

sc-driver is made to give you more clue that an interface isn't shutdown cleanly.

You'll have to manually delete the lock file at /var/lock/gossamer64.lock and the semaphore /dev/shm/sem.x10_device_semaphore to clean up. If not, the next time you run, you could hang (but you can ctrl-C to get out)

The bottom line is that the calling program has to be bug free. It should not be sc-driver or sc-gc-api's job to clean up when the caller crashes.

skuntz assigned yujenjuan and jasonriedy Nov 11, 2021

jasonriedy mentioned this issue Nov 15, 2021

Upon shutdown the Stated redis-serve with the redisgraph module, got EXCEPTION=0x5 #10

Open

yujenjuan added the wontfix This will not be worked on label Nov 16, 2021

jasonriedy removed their assignment Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception on shutdown #9

Exception on shutdown #9

skuntz commented Nov 11, 2021

jasonriedy commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021 •

edited

Loading

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

Exception on shutdown #9

Exception on shutdown #9

Comments

skuntz commented Nov 11, 2021

jasonriedy commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

jasonriedy commented Nov 15, 2021

yujenjuan commented Nov 15, 2021

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021 • edited Loading

jasonriedy commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

yujenjuan commented Nov 16, 2021

yujenjuan commented Nov 16, 2021 •

edited

Loading