Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11265

recovery-mds-scale test failover_ost fails with 'test_failover_ost returned 1' due to mkdir failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.5, Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_ost fails without failing over any OSTs due to mkdir failing. From the failover test session at https://testing.whamcloud.com/test_sets/46c80c12-a1f1-11e8-a5f2-52540065bddc, we see the following at the end of the test_log

      2018-08-15 23:39:25 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           0 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      ost7: 0 times
      Status: FAIL: rc=1
      

      From the suite log, we see that the client job failed during the first OSS failover

      Started lustre-OST0006
      ==== Checking the clients loads AFTER failover -- failure NOT OK
      Client load failed on node trevis-3vm7, rc=1
      Client load failed during failover. Exiting...
      

      On vm7, looking at the run_dd log, we can see that dd fails because a directory already exists

      2018-08-15 23:37:37: dd run starting
      + mkdir -p /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      mkdir: cannot create directory ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com’: File exists
      + /usr/bin/lfs setstripe -c -1 /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      + cd /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      /usr/lib64/lustre/tests/run_dd.sh: line 34: cd: /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com: Not a directory
      + sync
      ++ df -P /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      ++ awk '/:/ { print $4 }'
      + FREE_SPACE=13349248
      + BLKS=1501790
      + echoerr 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
      + echo 'Total free disk space is 13349248, 4k blocks to dd is 1501790'
      Total free disk space is 13349248, 4k blocks to dd is 1501790
      + df /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      + dd bs=4k count=1501790 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file
      dd: failed to open ‘/mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com/dd-file’: Not a directory
      

      Looking at the run_dd log from the previous test, failover_mds, we do see that the file d0.dd-trevis-* is created and the client tar job is signaled right after the call to remove the file is issued

      + echo '2018-08-15 23:37:07: dd succeeded'
      2018-08-15 23:37:07: dd succeeded
      + cd /tmp
      + rm -rf /mnt/lustre/d0.dd-trevis-3vm7.trevis.whamcloud.com
      ++ signaled
      +++ date '+%F %H:%M:%S'
      ++ echoerr '2018-08-15 23:37:41: client load was signaled to terminate'
      ++ echo '2018-08-15 23:37:41: client load was signaled to terminate'
      2018-08-15 23:37:41: client load was signaled to terminate
      +++ ps -eo '%c %p %r'
      +++ awk '/ 18302 / {print $3}'
      ++ local PGID=18241
      ++ kill -TERM -18241
      ++ sleep 5
      ++ kill -KILL -18241
      

      Maybe the job was killed before the file could be removed?

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: