[LU-11265] recovery-mds-scale test failover_ost fails with 'test_failover_ost returned 1' due to mkdir failure Created: 17/Aug/18  Updated: 19/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-mds-scale test_failover_ost fails without failing over any OSTs due to mkdir failing. From the failover test session at https://testing.whamcloud.com/test_sets/46c80c12-a1f1-11e8-a5f2-52540065bddc, we see the following at the end of the test_log

2018-08-15 23:39:25 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           0 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=1

From the suite log, we see that the client job failed during the first OSS failover

Started lustre-OST0006
==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node trevis-3vm7, rc=1
Client load failed during failover. Exiting...

On vm7, looking at the run_dd log, we can see that dd fails because a directory already exists

2018-08-15 23:37:37: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
mkdir: cannot create directory ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com’: File exists
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ cd /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
/usr/lib64/lustre/tests/run_dd.sh: line 34: cd: /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com: Not a directory
+ sync
++ df -P /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=13349248
+ BLKS=1501790
+ echoerr 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
+ echo 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
Total free disk space is 13349248, 4k blocks to dd is 1501790
+ df /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ dd bs=4k count=1501790 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file
dd: failed to open ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file’: Not a directory

Looking at the run_dd log from the previous test, failover_mds, we do see that the file d0.dd-trevis-* is created and the client tar job is signaled right after the call to remove the file is issued

+ echo '2018-08-15 23:37:07: dd succeeded'
2018-08-15 23:37:07: dd succeeded
+ cd /tmp
+ rm -rf /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ signaled
+++ date '+%F %H:%M:%S'
++ echoerr '2018-08-15 23:37:41: client load was signaled to terminate'
++ echo '2018-08-15 23:37:41: client load was signaled to terminate'
2018-08-15 23:37:41: client load was signaled to terminate
+++ ps -eo '%c %p %r'
+++ awk '/ 18302 / {print $3}'
++ local PGID=18241
++ kill -TERM -18241
++ sleep 5
++ kill -KILL -18241

Maybe the job was killed before the file could be removed?



 Comments   
Comment by James Nunez (Inactive) [ 26/Sep/19 ]

We see this issue for recent versions of 2.12; https://testing.whamcloud.com/test_sets/4d26e47c-dfec-11e9-a0ba-52540065bddc . When this happens, the test following recovery-mds-scale will fail even if all tests pass due to not being able to remove the contents of the test directory.

For example (see test session https://testing.whamcloud.com/test_sessions/40ec2f5f-9e8e-45a7-97c8-a84baf2e4b55), we see recovery-random-scale and recovery-double-scale fail due to

== recovery-random-scale test complete, duration 85619 sec =========================================== 11:09:20 (1569409760)
rm: cannot remove '/mnt/lustre/d0.tar-trevis-40vm9.trevis.whamcloud.com/etc': Directory not empty
 recovery-random-scale test_fail_client_mds: @@@@@@ FAIL: remove sub-test dirs failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:5829:error()
  = /usr/lib64/lustre/tests/test-framework.sh:5316:check_and_cleanup_lustre()
Comment by Elena Gryaznova [ 19/Nov/21 ]

one more:
https://testing.whamcloud.com/test_sets/03b6d0ca-04e6-4426-a4fc-a701404fc4f7
https://testing.whamcloud.com/test_logs/ae021887-19cd-45a7-818a-07cfc12c0644/show_text

+ /usr/bin/lfs mkdir -i1 -c1 /mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com
lfs mkdir: dirstripe error on '/mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com': stripe already set
lfs setdirstripe: cannot create dir '/mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com': File exists
+ return 1
Generated at Sat Feb 10 02:42:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.