Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
lola
build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)
-
3
-
9223372036854775807
Description
The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)
Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:
... ... 03:01:31: Running test #10(iter 42): mkdir, shared mode. 03:01:31: Running test #10(iter 43): mkdir, shared mode. 03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- In: PMI_Abort(1, N/A) srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** srun: error: lola-32: task 3: Killed srun: Terminating job step 393832.0 srun: error: lola-29: task 2: Killed srun: error: lola-27: task 0: Exited with exit code 1 srun: error: lola-27: task 1: Killed
Each job crash can be temporal correlated perfectly to an event on a MDS:
lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages