[LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524 Created: 16/Dec/15 Updated: 09/Sep/16 Resolved: 05/Feb/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2, soak | ||
| Environment: |
lola |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214) Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message: ... ... 03:01:31: Running test #10(iter 42): mkdir, shared mode. 03:01:31: Running test #10(iter 43): mkdir, shared mode. 03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- In: PMI_Abort(1, N/A) srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** [lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 *** srun: error: lola-32: task 3: Killed srun: Terminating job step 393832.0 srun: error: lola-29: task 2: Killed srun: error: lola-27: task 0: Exited with exit code 1 srun: error: lola-27: task 1: Killed Each job crash can be temporal correlated perfectly to an event on a MDS: lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages |
| Comments |
| Comment by Frank Heckes (Inactive) [ 16/Dec/15 ] |
|
The errors happen during normal operations, i.e. at times when no fault was injected. |
| Comment by Frank Heckes (Inactive) [ 17/Dec/15 ] |
|
Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 lola-11 lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524 |
| Comment by Frank Heckes (Inactive) [ 17/Dec/15 ] |
|
Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate. |
| Comment by Di Wang [ 23/Jan/16 ] |
|
Pushed a patch http://review.whamcloud.com/16838, which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure. Please try this one. |
| Comment by Di Wang [ 23/Jan/16 ] |
|
Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios 1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1 If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes. |
| Comment by Gerrit Updater [ 26/Jan/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165 |
| Comment by Frank Heckes (Inactive) [ 27/Jan/16 ] |
|
Error is present for build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126) Test was executed without any fault being injected (aka no mds restart/failover, oss failover) After mounting the FS even the previous dangling files don't disappear (see list in attached file 'dangling-files-before-restart-build-20160126') 01/27/2016 04:51:32: Process 5(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error 01/27/2016 04:51:32: Process 3(lola-31.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error 01/27/2016 04:51:32: Process 2(lola-30.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error slurmd[lola-34]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 *** srun: Job step aborted: Waiting up to 2 seconds for job step to finish. slurmd[lola-32]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 *** server error: lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) error during execution of #8 from /lbuilds/soak-builds/workspace/lustr e-soaked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:503: rc = -17 lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) Skipped 4 previous similar messages lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soa ked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:387: rc = -524 lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) Skipped 17 previous similar messages lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899092.15929 lola-10.log:Jan 27 04:51:33 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899093.15184 dangling files in FS [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/ ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/dir.mdtest.0.61: No such file or directory total 0 d????????? ? ? ? ? ? dir.mdtest.0.61 [root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0 ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0/dir.mdtest.5.61: No such file or directory total 0 d????????? ? ? ? ? ? dir.mdtest.5.61 Attached file containing lists of dangling files and debug log files mentioned in the MDT error message above. |
| Comment by Gerrit Updater [ 29/Jan/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206 |
| Comment by Gerrit Updater [ 05/Feb/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/ |
| Comment by Joseph Gmitter (Inactive) [ 05/Feb/16 ] |
|
Patch has landed for 2.8 |