[LU-7564] (out_handler.c:854:out_tx_end()) ... rc = -524 Created: 16/Dec/15  Updated: 09/Sep/16  Resolved: 05/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2, soak
Environment:

lola
build: tip of master (commit ae3a2891f10a19acf855a90337316dda704da5d)


Attachments: HTML File dangling-files-after-restart-build-20160126     HTML File dangling-files-before-restart-build-20160126     File lola-10-lustre-log.for-LU-7564.20151217T0035.bz2     Text File lola-11-lustre-log-LU-7565-20151217-0125.bz2     File lustre-log.1453899092.15929.bz2     File lustre-log.1453899093.15184.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The error happens during soak testing of build '20151214' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20151214)

Approximately 10% of the total amount of batch jobs using simul crash with 'typical' error message:

...
...
03:01:31: Running test #10(iter 42): mkdir, shared mode.
03:01:31: Running test #10(iter 43): mkdir, shared mode.
03:01:31: Process 0(lola-27.lola.whamcloud.com): FAILED in remove_dirs, rmdir failed: Input/output error
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
In: PMI_Abort(1, N/A)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
[lola-29][[616,1],2][btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-32]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-29]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
slurmd[lola-27]: *** STEP 393832.0 KILLED AT 2015-12-16T03:01:31 WITH SIGNAL 9 ***
srun: error: lola-32: task 3: Killed
srun: Terminating job step 393832.0
srun: error: lola-29: task 2: Killed
srun: error: lola-27: task 0: Exited with exit code 1
srun: error: lola-27: task 1: Killed

Each job crash can be temporal correlated perfectly to an event on a MDS:

lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
lola-10.log:Dec 16 03:01:31 lola-10 kernel: LustreError: 5589:0:(out_handler.c:854:out_tx_end()) Skipped 3 previous similar messages


 Comments   
Comment by Frank Heckes (Inactive) [ 16/Dec/15 ]

The errors happen during normal operations, i.e. at times when no fault was injected.

Comment by Frank Heckes (Inactive) [ 17/Dec/15 ]

Attached the debug logs taken on two MDSes after the event occurred. For the sequence of timing, here are the
error messages for each event and node:
lola-10

lola-10.log:Dec 17 00:29:51 lola-10 kernel: LustreError: 5454:0:(out_handler.c:854:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soaked-20151214/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524

lola-11

lola-11.log:Dec 17 01:22:34 lola-11 kernel: LustreError: 6293:0:(out_handler.c:854:out_tx_end()) soaked-MDT0007-osd: undo for /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.64/lustre/ptlrpc/../../lustre/target/out_handler.c:385: rc = -524
Comment by Frank Heckes (Inactive) [ 17/Dec/15 ]

Debug mask is set to default. Please let me know if you like to set to an other value that is more appropriate.

Comment by Di Wang [ 23/Jan/16 ]

Pushed a patch http://review.whamcloud.com/16838, which includes all of DNE fixes and some fixes on remote llog (in llog_cat_new_log()), which might cause recovery failure.

Please try this one.

Comment by Di Wang [ 23/Jan/16 ]

Another possible reason is that soak-test is doing double MDT failover, without COS, that might cause corruption during failover, considering following scenarios

1. Client1 send the operation(Op1) to MDT1, and MDT1 distribute updates of Op1 to MDT2, then after finish Op1, MDT1 sends reply to client1
2. Client2 send the operation Op2 to MDT3, and MDT3 distribute updates of Op2 to MDT2, (Note: Op2 depends on Op1).
3. Before Op1 committed to MDT1 and MDT2, both MDT1 and MDT2 reboots.
4. After MDT1 restarts, client1 will send Op1 to MDT1, and MDT1 will distribute updates of Op1 to MDT2, but with different xid and 0 transno.
5. After MDT2 restarts and recovers, of course it will ignore the updates of Op1 because of its 0 transno, instead it will receive the replay updates from MDT3 (Op2). Then it will fail of course.

If we have COS here, Op1 will be committed before Op2 starts, then it will help a lot here. I includes the COS patch in 16838, see how it goes.

Comment by Gerrit Updater [ 26/Jan/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18165
Subject: LU-7564 llog: separate llog creation with initialization
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 734c40e674eb14055648d26af99aca9de4d0fd4f

Comment by Frank Heckes (Inactive) [ 27/Jan/16 ]

Error is present for build '20160126' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160126)

Test was executed without any fault being injected (aka no mds restart/failover, oss failover)

After mounting the FS even the previous dangling files don't disappear (see list in attached file 'dangling-files-before-restart-build-20160126')
Also new dangling files are leftover from job - crashes (see list in attached file 'dangling-files-after-restart-build-20160126')
Here the job crash can be corrleated to rc == -524 - event and the dangling file(s) in the end. E.g.
JOB 420993:
from job output file

 
01/27/2016 04:51:32: Process 5(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
01/27/2016 04:51:32: Process 3(lola-31.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
01/27/2016 04:51:32: Process 2(lola-30.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
slurmd[lola-34]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[lola-32]: *** STEP 420993.0 KILLED AT 2016-01-27T04:51:32 WITH SIGNAL 9 ***

server error:

lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) error during execution of #8 from /lbuilds/soak-builds/workspace/lustr
e-soaked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:503: rc = -17
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:846:out_tx_end()) Skipped 4 previous similar messages
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) soaked-MDT0004-osd: undo for /lbuilds/soak-builds/workspace/lustre-soa
ked-20160126/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.65/lustre/ptlrpc/../../lustre/target/out_handler.c:387: rc = -524
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: 15929:0:(out_handler.c:856:out_tx_end()) Skipped 17 previous similar messages
lola-10.log:Jan 27 04:51:32 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899092.15929
lola-10.log:Jan 27 04:51:33 lola-10 kernel: LustreError: dumping log to /tmp/lustre-log.1453899093.15184

dangling files in FS

[root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/
ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.0.0/dir.mdtest.0.61: No such file or directory
total 0
d????????? ? ? ? ?            ? dir.mdtest.0.61
[root@lola-16 ~]# ll /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0
ls: cannot access /mnt/soaked//soaktest/test/mdtestfpp/420993/#test-dir.1/mdtest_tree.5.0/dir.mdtest.5.61: No such file or directory
total 0
d????????? ? ? ? ?            ? dir.mdtest.5.61

Attached file containing lists of dangling files and debug log files mentioned in the MDT error message above.

Comment by Gerrit Updater [ 29/Jan/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18206
Subject: LU-7564 osp: lock remote object exclusively
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 468a6d9ee4854740353cc41c04aa70ab6155e069

Comment by Gerrit Updater [ 05/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18206/
Subject: LU-7564 osp: Do not match the lock for OSP
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: beab72b475c6006f53d5cab628cfdbe6dca09b32

Comment by Joseph Gmitter (Inactive) [ 05/Feb/16 ]

Patch has landed for 2.8

Generated at Sat Feb 10 02:09:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.