[LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.8.0
Labels:
- dne2
- soak
Environment:
lola
build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

...
...
03:01:31: Running test #10(iter 42): mkdir, shared mode.
03:01:31: Running test #10(iter 43): mkdir, shared mode.
03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(1, N/A)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
[lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
srun: error: lola-32: task 3: Killed
srun: Terminating job step 393832.0
srun: error: lola-29: task 2: Killed
srun: error: lola-27: task 0: Exited with exit code 1
srun: error: lola-27: task 1: Killed

Each job crash can be temporal correlated perfectly to an event on a MDS:

lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dangling-files-after-restart-build-20160126
2 kB
27/Jan/16 5:38 PM
dangling-files-before-restart-build-20160126
16 kB
27/Jan/16 5:38 PM
lola-10-lustre-log.for-LU-7564.20151217T0035.bz2
5 kB
17/Dec/15 10:17 AM
lola-11-lustre-log-LU-7565-20151217-0125.bz2
0.3 kB
17/Dec/15 10:17 AM
lustre-log.1453899092.15929.bz2
3.36 MB
28/Jan/16 7:51 AM
lustre-log.1453899093.15184.bz2
80 kB
27/Jan/16 5:38 PM

Activity

People

Assignee:: Di Wang (Inactive)

Reporter:: Frank Heckes (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 16/Dec/15 12:23 PM

Updated:: 09/Sep/16 7:28 PM

Resolved:: 05/Feb/16 3:26 PM