Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10479

recovery-mds-scale test failover_mds fails with 'test_failover_mds returned 4'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.6, Lustre 2.10.7
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale is failing in test_failover_mds. From the test_log on the client, we can see that the clients exited immediately and no failovers took place:

      Client load failed on node trevis-10vm4, rc=1
      2018-01-08 21:34:46 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           0 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 0 times
      ost2: 0 times
      ost3: 0 times
      ost4: 0 times
      ost5: 0 times
      ost6: 0 times
      ost7: 0 times
      Status: FAIL: rc=4
      

      The client load/jobs are terminated because tar is failing due to not enough (any?) free space available. From the run_tar_debug log, strangely, free space is blank:

      ++ du -s /etc
      ++ awk '{print $1}'
      + USAGE=30784
      + /usr/sbin/lctl set_param 'llite.*.lazystatfs=0'
      + df /mnt/lustre/d0.tar-trevis-10vm4.trevis.hpdd.intel.com
      + sleep 2
      ++ df /mnt/lustre/d0.tar-trevis-10vm4.trevis.hpdd.intel.com
      ++ awk '/:/ { print $4 }'
      + FREE_SPACE=
      + AVAIL=0
      + '[' 0 -lt 30784 ']'
      + echoerr 'no enough free disk space: need 30784, avail 0'
      + echo 'no enough free disk space: need 30784, avail 0'
      no enough free disk space: need 30784, avail 0
      

      There is nothing obviously wrong in the console and dmesg logs.

      So far, this failure is only seen on b2_10, but for several builds. Here are the b2_10 build numbers and links to logs for some of the failures:
      build #17 CentOS 6.9 clients/ CentOS 6.9 ldiskfs servers - https://testing.hpdd.intel.com/test_sets/72399370-881b-11e7-b3ca-5254006e85c2
      build #19 CentOS 6.9 clients/ CentOS 6.9 ZFS servers - https://testing.hpdd.intel.com/test_sets/c4fabe7e-937c-11e7-b722-5254006e85c2
      build #30 CentOS 6.9 clients/ CentOS 7 ldiskfs servers - https://testing.hpdd.intel.com/test_sets/abaa60d2-a862-11e7-bb19-5254006e85c2
      build #45 CentOS 6.9 clients/ CentOS 7 ldiskfs servers - https://testing.hpdd.intel.com/test_sets/29beec4e-caa1-11e7-9840-52540065bddc
      build #52 CentOS 6.9 clients/ CentOS 7 ldiskfs servers - https://testing.hpdd.intel.com/test_sets/76d51ac8-df3f-11e7-8027-52540065bddc
      build #68 CentOS 6.9 clients/ CentOS 7 ldiskfs servers - https://testing.hpdd.intel.com/test_sets/e5214052-f52d-11e7-8c23-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: