[LU-11265] recovery-mds-scale test failover_ost fails with 'test_failover_ost returned 1' due to mkdir failure Created: 17/Aug/18 Updated: 19/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5, Lustre 2.12.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
recovery-mds-scale test_failover_ost fails without failing over any OSTs due to mkdir failing. From the failover test session at https://testing.whamcloud.com/test_sets/46c80c12-a1f1-11e8-a5f2-52540065bddc, we see the following at the end of the test_log 2018-08-15 23:39:25 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 0 seconds Number of failovers before exit: mds1: 0 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times ost7: 0 times Status: FAIL: rc=1 From the suite log, we see that the client job failed during the first OSS failover Started lustre-OST0006 ==== Checking the clients loads AFTER failover -- failure NOT OK Client load failed on node trevis-3vm7, rc=1 Client load failed during failover. Exiting... On vm7, looking at the run_dd log, we can see that dd fails because a directory already exists 2018-08-15 23:37:37: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
mkdir: cannot create directory ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com’: File exists
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ cd /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
/usr/lib64/lustre/tests/run_dd.sh: line 34: cd: /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com: Not a directory
+ sync
++ df -P /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=13349248
+ BLKS=1501790
+ echoerr 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
+ echo 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
Total free disk space is 13349248, 4k blocks to dd is 1501790
+ df /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ dd bs=4k count=1501790 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file
dd: failed to open ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file’: Not a directory
Looking at the run_dd log from the previous test, failover_mds, we do see that the file d0.dd-trevis-* is created and the client tar job is signaled right after the call to remove the file is issued + echo '2018-08-15 23:37:07: dd succeeded'
2018-08-15 23:37:07: dd succeeded
+ cd /tmp
+ rm -rf /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ signaled
+++ date '+%F %H:%M:%S'
++ echoerr '2018-08-15 23:37:41: client load was signaled to terminate'
++ echo '2018-08-15 23:37:41: client load was signaled to terminate'
2018-08-15 23:37:41: client load was signaled to terminate
+++ ps -eo '%c %p %r'
+++ awk '/ 18302 / {print $3}'
++ local PGID=18241
++ kill -TERM -18241
++ sleep 5
++ kill -KILL -18241
Maybe the job was killed before the file could be removed? |
| Comments |
| Comment by James Nunez (Inactive) [ 26/Sep/19 ] |
|
We see this issue for recent versions of 2.12; https://testing.whamcloud.com/test_sets/4d26e47c-dfec-11e9-a0ba-52540065bddc . When this happens, the test following recovery-mds-scale will fail even if all tests pass due to not being able to remove the contents of the test directory. For example (see test session https://testing.whamcloud.com/test_sessions/40ec2f5f-9e8e-45a7-97c8-a84baf2e4b55), we see recovery-random-scale and recovery-double-scale fail due to == recovery-random-scale test complete, duration 85619 sec =========================================== 11:09:20 (1569409760) rm: cannot remove '/mnt/lustre/d0.tar-trevis-40vm9.trevis.whamcloud.com/etc': Directory not empty recovery-random-scale test_fail_client_mds: @@@@@@ FAIL: remove sub-test dirs failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:5829:error() = /usr/lib64/lustre/tests/test-framework.sh:5316:check_and_cleanup_lustre() |
| Comment by Elena Gryaznova [ 19/Nov/21 ] |
|
one more: + /usr/bin/lfs mkdir -i1 -c1 /mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com lfs mkdir: dirstripe error on '/mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com': stripe already set lfs setdirstripe: cannot create dir '/mnt/lustre/d0.tar-onyx-72vm14.onyx.whamcloud.com': File exists + return 1 |