Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in zeromq #34

Closed
trws opened this issue Oct 6, 2014 · 3 comments
Closed

Assertion failure in zeromq #34

trws opened this issue Oct 6, 2014 · 3 comments
Labels

Comments

@trws
Copy link
Member

trws commented Oct 6, 2014

The actual error printed is:

Assertion failed: ok (mailbox.cpp:82)

After which the KAP benchmark test aborts during MPI_Init() and dumps core files.

Fortunately the cmb also dumps, hopefully more useful, core files. Just one core file per failure it seems, and they all fail with the same backtrace (excepting specific memory addresses etc.).

#0  0x00002aaaab9d8635 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00002aaaab9d9e15 in abort () at abort.c:92
#2  0x00002aaaab0f8239 in ?? () from /usr/lib64/libzmq.so.3
#3  0x00002aaaab0fb7a7 in ?? () from /usr/lib64/libzmq.so.3
#4  0x00002aaaab110c9b in ?? () from /usr/lib64/libzmq.so.3
#5  0x00002aaaab111078 in ?? () from /usr/lib64/libzmq.so.3
#6  0x00002aaaab1250fa in ?? () from /usr/lib64/libzmq.so.3
#7  0x00002aaaab3513c5 in zframe_send () from /usr/lib64/libczmq.so.1
#8  0x00002aaaab357e9d in zmsg_send () from /usr/lib64/libczmq.so.1
#9  0x0000000000408688 in cmb_pub_event (ctx=0x7fffffffcf40, event=0x7fffffffcea0) at ../../../flux-core/src/broker/cmbd.c:1420
#10 0x0000000000409e45 in hb_cb (zl=0x624410, timer_id=1, ctx=0x7fffffffcf40) at ../../../flux-core/src/broker/cmbd.c:1778
#11 0x00002aaaab355cad in zloop_start () from /usr/lib64/libczmq.so.1
#12 0x00000000004054b5 in main (argc=5, argv=0x7fffffffd218) at ../../../flux-core/src/broker/cmbd.c:477

I was running flux commit 3b67cb3 with Jim's additional patch to deal with the named socket issue getting kap to launch with srun, basically removing the per-rank component of the path for it.

The run configuration is a pre-allocated 32-node slurm job, so it could easily be launched as a batch, running this flux command ./flux start -M barrier -N 32 -s 32 <absolute path to run script>. The run script launching KAP is below.

#! /bin/sh

MY_TCC=512
MY_N=32
MY_P=1
MY_C=512
MY_V=64
MY_K=/g/g12/scogland/projects/flux/build/src/test/kap/kap
MY_D=/g/g12/scogland/projects/flux/data/powers-of-two/test-32/T.512:P.1:C.512:V.64:A.1
MY_A=1
RUNS=1
VARY_SEQUENCE=1

cd $MY_D
t=$(date)
echo "$t\n"

sleep 3

for i in $(seq 1 $RUNS) ; do
  if [ $VARY_SEQUENCE -ne 0 ] ; then
    SEQUENCE_NUM=$i
  else
    SEQUENCE_NUM=0
  fi
  COMMAND=" srun -N$MY_N -n$MY_TCC --distribution=cyclic $MY_K --instance-num=$SEQUENCE_NUM -l --nproducers=$MY_P --nconsumers=$MY_C --value-size=$MY_V --cons-acc-count=$MY_A"
  mkdir run-$i
  pushd run-$i
  $COMMAND
  popd
done

t=$(date)
echo "$t \n"
@garlick
Copy link
Member

garlick commented Oct 7, 2014

This has to be rank 0 because hb_cb() is only invoked there, from a zloop timer to generate heartbeat event messages.

This issue has been reported before by the zeromq community, though I'm not clear if
the reported contexts are anything like ours (windows OS, in shutdown path, etc)
zeromq/libzmq#1108
http://lists.zeromq.org/pipermail/zeromq-dev/2014-March/025705.html??I
zeromq/libzmq#1193

@allendrennan
Copy link

See my comment here,
zeromq/libzmq#1108

@garlick garlick added the bug label Oct 16, 2014
@garlick
Copy link
Member

garlick commented Dec 28, 2016

This is pretty ancient and I'm not aware that we're seeing this anymore, so closing. Reopen if it recurs.

@garlick garlick closed this as completed Dec 28, 2016
grondo added a commit to grondo/flux-core that referenced this issue Dec 12, 2019
libutil/sigcert: use kv serialization not JSON
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants