Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.5, Lustre 2.12.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

recovery-mds-scale test_failover_ost fails without failing over any OSTs due to mkdir failing. From the failover test session at https://testing.whamcloud.com/test_sets/46c80c12-a1f1-11e8-a5f2-52540065bddc, we see the following at the end of the test_log

2018-08-15 23:39:25 Terminating clients loads ...
Duration:               86400
Server failover period: 1200 seconds
Exited after:           0 seconds
Number of failovers before exit:
mds1: 0 times
ost1: 0 times
ost2: 0 times
ost3: 0 times
ost4: 0 times
ost5: 0 times
ost6: 0 times
ost7: 0 times
Status: FAIL: rc=1

From the suite log, we see that the client job failed during the first OSS failover

Started lustre-OST0006
==== Checking the clients loads AFTER failover -- failure NOT OK
Client load failed on node trevis-3vm7, rc=1
Client load failed during failover. Exiting...

On vm7, looking at the run_dd log, we can see that dd fails because a directory already exists

2018-08-15 23:37:37: dd run starting
+ mkdir -p /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
mkdir: cannot create directory ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com’: File exists
+ /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ cd /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
/usr/lib64/lustre/tests/run_dd.sh: line 34: cd: /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com: Not a directory
+ sync
++ df -P /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ awk '/:/ { print $4 }'
+ FREE_SPACE=13349248
+ BLKS=1501790
+ echoerr 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
+ echo 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
Total free disk space is 13349248, 4k blocks to dd is 1501790
+ df /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
+ dd bs=4k count=1501790 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file
dd: failed to open ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file’: Not a directory

Looking at the run_dd log from the previous test, failover_mds, we do see that the file d0.dd-trevis-* is created and the client tar job is signaled right after the call to remove the file is issued

+ echo '2018-08-15 23:37:07: dd succeeded'
2018-08-15 23:37:07: dd succeeded
+ cd /tmp
+ rm -rf /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
++ signaled
+++ date '+%F %H:%M:%S'
++ echoerr '2018-08-15 23:37:41: client load was signaled to terminate'
++ echo '2018-08-15 23:37:41: client load was signaled to terminate'
2018-08-15 23:37:41: client load was signaled to terminate
+++ ps -eo '%c %p %r'
+++ awk '/ 18302 / {print $3}'
++ local PGID=18241
++ kill -TERM -18241
++ sleep 5
++ kill -KILL -18241

Maybe the job was killed before the file could be removed?

mentioned in: Page Loading...; Page Loading...; Page Loading...

Assignee:: WC Triage

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 17/Aug/18 6:51 PM

Updated:: 19/Nov/21 10:16 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates