Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.5, Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_ost fails without failing over any OSTs due to mkdir failing. From the failover test session at https://testing.whamcloud.com/test_sets/46c80c12-a1f1-11e8-a5f2-52540065bddc, we see the following at the end of the test_log
2018-08-15 23:39:25 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 0 seconds Number of failovers before exit: mds1: 0 times ost1: 0 times ost2: 0 times ost3: 0 times ost4: 0 times ost5: 0 times ost6: 0 times ost7: 0 times Status: FAIL: rc=1
From the suite log, we see that the client job failed during the first OSS failover
Started lustre-OST0006 ==== Checking the clients loads AFTER failover -- failure NOT OK Client load failed on node trevis-3vm7, rc=1 Client load failed during failover. Exiting...
On vm7, looking at the run_dd log, we can see that dd fails because a directory already exists
2018-08-15 23:37:37: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
mkdir: cannot create directory ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com’: File exists
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ cd /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
/usr/lib64/lustre/tests/run_dd.sh: line 34: cd: /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com: Not a directory
+ sync
++ df -P /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=13349248
+ BLKS=1501790
+ echoerr 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
+ echo 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
Total free disk space is 13349248, 4k blocks to dd is 1501790
+ df /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ dd bs=4k count=1501790 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file
dd: failed to open ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file’: Not a directory
Looking at the run_dd log from the previous test, failover_mds, we do see that the file d0.dd-trevis-* is created and the client tar job is signaled right after the call to remove the file is issued
+ echo '2018-08-15 23:37:07: dd succeeded'
2018-08-15 23:37:07: dd succeeded
+ cd /tmp
+ rm -rf /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ signaled
+++ date '+%F %H:%M:%S'
++ echoerr '2018-08-15 23:37:41: client load was signaled to terminate'
++ echo '2018-08-15 23:37:41: client load was signaled to terminate'
2018-08-15 23:37:41: client load was signaled to terminate
+++ ps -eo '%c %p %r'
+++ awk '/ 18302 / {print $3}'
++ local PGID=18241
++ kill -TERM -18241
++ sleep 5
++ kill -KILL -18241
Maybe the job was killed before the file could be removed?