[LU-11765] during failover test run, mdtest job fails, numerous stat failures 'No such file or directory' Created: 12/Dec/18  Updated: 29/Jun/22  Resolved: 17/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Major
Reporter: Sergey Cheremencev Assignee: Sergey Cheremencev
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running a failover test(random failover of OSSs + mdtest), an mdtest job failed reporting stat failures. This looked similar to LU-11760, except that this time all the files were actually present, valid after the job failure.

V-1: Entering create_remove_items_helper...
V-1: Entering unique_dir_access...
V-1: Entering mdtest_stat...
08/19/2018 07:15:43: Process 10(nid00265): FAILED in mdtest_stat, unable to stat file: No such file or directory
08/19/2018 07:15:43: Process 15(nid00279): FAILED in mdtest_stat, unable to stat file: No such file or directory
Rank 10 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10
Rank 15 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15

Details of fail reason could be found in a patch commit message. I will upload it in the nearest time.



 Comments   
Comment by Gerrit Updater [ 12/Dec/18 ]

Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33836
Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c13e591a85bfab7df8a341137a160e5690175220

Comment by Sergey Cheremencev [ 12/Dec/18 ]

Below is example how to start replay-ost-single_12 on local virtual machines. Note OST1 should be on a separate node.

LOAD_MODULES_REMOTE=true PDSH="/usr/local/bin/pdsh -S -R ssh -w" POWER_UP="" FAILURE_MODE=HARD POWER_DOWN="pdsh -S -R ssh -w dhcppc4 echo c > /proc/sysrq-trigger&" NOFORMAT=yes mds_HOST=dhcppc3 mgs_HOST=dhcppc3 OSTCOUNT=1 ost1_HOST=dhcppc4 ost2_HOST=dhcppc3 ONLY=12 bash /root/src/lustre-wc-rel/lustre/tests/replay-ost-single.sh

With a t-f patch if modules on the second node don't start automatically:

[root@dhcppc3 tests]# git diff test-framework.sh 
diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh
index b42bc9c..2c2daae 100755
--- a/lustre/tests/test-framework.sh
+++ b/lustre/tests/test-framework.sh
@@ -1902,7 +1902,7 @@ mount_facet() {
        local devicelabel
        local dm_dev=${!dev}
 
-       module_loaded lustre || load_modules
+       load_modules
 
        case $fstype in
Comment by Gerrit Updater [ 15/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33836/
Subject: LU-11765 ofd: return EAGAIN during 1st CLEANUP_ORPHAN
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dc52a88cde1e7cea093b25fc9a15509fe0ac527a

Comment by Cory Spitz [ 17/Apr/19 ]

What work remains here?

Comment by Cory Spitz [ 17/Apr/19 ]

This is resolved with L2.13.0.

Generated at Sat Feb 10 02:46:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.