[LU-11765] during failover test run, mdtest job fails, numerous stat failures 'No such file or directory' Created: 12/Dec/18 Updated: 29/Jun/22 Resolved: 17/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Cheremencev | Assignee: | Sergey Cheremencev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Running a failover test(random failover of OSSs + mdtest), an mdtest job failed reporting stat failures. This looked similar to V-1: Entering create_remove_items_helper... V-1: Entering unique_dir_access... V-1: Entering mdtest_stat... 08/19/2018 07:15:43: Process 10(nid00265): FAILED in mdtest_stat, unable to stat file: No such file or directory 08/19/2018 07:15:43: Process 15(nid00279): FAILED in mdtest_stat, unable to stat file: No such file or directory Rank 10 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n1] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10 Rank 15 [Sun Aug 19 07:15:43 2018] [c1-0c0s4n3] application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15 Details of fail reason could be found in a patch commit message. I will upload it in the nearest time. |
| Comments |
| Comment by Gerrit Updater [ 12/Dec/18 ] |
|
Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33836 |
| Comment by Sergey Cheremencev [ 12/Dec/18 ] |
|
Below is example how to start replay-ost-single_12 on local virtual machines. Note OST1 should be on a separate node. LOAD_MODULES_REMOTE=true PDSH="/usr/local/bin/pdsh -S -R ssh -w" POWER_UP="" FAILURE_MODE=HARD POWER_DOWN="pdsh -S -R ssh -w dhcppc4 echo c > /proc/sysrq-trigger&" NOFORMAT=yes mds_HOST=dhcppc3 mgs_HOST=dhcppc3 OSTCOUNT=1 ost1_HOST=dhcppc4 ost2_HOST=dhcppc3 ONLY=12 bash /root/src/lustre-wc-rel/lustre/tests/replay-ost-single.sh With a t-f patch if modules on the second node don't start automatically: [root@dhcppc3 tests]# git diff test-framework.sh
diff --git a/lustre/tests/test-framework.sh b/lustre/tests/test-framework.sh
index b42bc9c..2c2daae 100755
--- a/lustre/tests/test-framework.sh
+++ b/lustre/tests/test-framework.sh
@@ -1902,7 +1902,7 @@ mount_facet() {
local devicelabel
local dm_dev=${!dev}
- module_loaded lustre || load_modules
+ load_modules
case $fstype in
|
| Comment by Gerrit Updater [ 15/Mar/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33836/ |
| Comment by Cory Spitz [ 17/Apr/19 ] |
|
What work remains here? |
| Comment by Cory Spitz [ 17/Apr/19 ] |
|
This is resolved with L2.13.0. |