You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After which the KAP benchmark test aborts during MPI_Init() and dumps core files.
Fortunately the cmb also dumps, hopefully more useful, core files. Just one core file per failure it seems, and they all fail with the same backtrace (excepting specific memory addresses etc.).
#0 0x00002aaaab9d8635 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00002aaaab9d9e15 in abort () at abort.c:92
#2 0x00002aaaab0f8239 in ?? () from /usr/lib64/libzmq.so.3
#3 0x00002aaaab0fb7a7 in ?? () from /usr/lib64/libzmq.so.3
#4 0x00002aaaab110c9b in ?? () from /usr/lib64/libzmq.so.3
#5 0x00002aaaab111078 in ?? () from /usr/lib64/libzmq.so.3
#6 0x00002aaaab1250fa in ?? () from /usr/lib64/libzmq.so.3
#7 0x00002aaaab3513c5 in zframe_send () from /usr/lib64/libczmq.so.1
#8 0x00002aaaab357e9d in zmsg_send () from /usr/lib64/libczmq.so.1
#9 0x0000000000408688 in cmb_pub_event (ctx=0x7fffffffcf40, event=0x7fffffffcea0) at ../../../flux-core/src/broker/cmbd.c:1420
#10 0x0000000000409e45 in hb_cb (zl=0x624410, timer_id=1, ctx=0x7fffffffcf40) at ../../../flux-core/src/broker/cmbd.c:1778
#11 0x00002aaaab355cad in zloop_start () from /usr/lib64/libczmq.so.1
#12 0x00000000004054b5 in main (argc=5, argv=0x7fffffffd218) at ../../../flux-core/src/broker/cmbd.c:477
I was running flux commit 3b67cb3 with Jim's additional patch to deal with the named socket issue getting kap to launch with srun, basically removing the per-rank component of the path for it.
The run configuration is a pre-allocated 32-node slurm job, so it could easily be launched as a batch, running this flux command ./flux start -M barrier -N 32 -s 32 <absolute path to run script>. The run script launching KAP is below.
The actual error printed is:
After which the KAP benchmark test aborts during
MPI_Init()
and dumps core files.Fortunately the cmb also dumps, hopefully more useful, core files. Just one core file per failure it seems, and they all fail with the same backtrace (excepting specific memory addresses etc.).
I was running flux commit 3b67cb3 with Jim's additional patch to deal with the named socket issue getting kap to launch with srun, basically removing the per-rank component of the path for it.
The run configuration is a pre-allocated 32-node slurm job, so it could easily be launched as a batch, running this flux command
./flux start -M barrier -N 32 -s 32 <absolute path to run script>
. The run script launching KAP is below.The text was updated successfully, but these errors were encountered: