Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5759

replay-dual test_21b: Restart of mds0 failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0
    • Lustre 2.7.0
    • None
    • client and server: lustre-master build #2690
    • 3
    • 16158

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/aff6c30c-5427-11e4-abcf-5254006e85c2.

      The sub-test test_21b failed with the following error:

      Restart of mds0 failed!
      
      08:51:44:Lustre: DEBUG MARKER: == replay-dual test 21b: commit on sharing, two clients == 08:51:21 (1413301881)
      08:51:44:LustreError: 21059:0:(qsd_reint.c:54:qsd_reint_completion()) lustre-MDT0001: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
      08:51:44:LustreError: 21059:0:(qsd_reint.c:54:qsd_reint_completion()) Skipped 3 previous similar messages
      08:51:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-dual test_21b: @@@@@@ FAIL: Restart of mds0 failed! 
      08:51:44:Lustre: DEBUG MARKER: replay-dual test_21b: @@@@@@ FAIL: Restart of mds0 failed!
      

      Info required for matching: replay-dual 21b

      Attachments

        Issue Links

          Activity

            [LU-5759] replay-dual test_21b: Restart of mds0 failed

            Patch landed to Master.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12363/
            Subject: LU-5759 tests: use lfs getstripe -M instead of get_mds_num
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 9804f872d28315a7ddfe835138bb33e2206ffe52

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12363/ Subject: LU-5759 tests: use lfs getstripe -M instead of get_mds_num Project: fs/lustre-release Branch: master Current Patch Set: Commit: 9804f872d28315a7ddfe835138bb33e2206ffe52
            yujian Jian Yu added a comment -

            We need add replay-dual back into patch review test group. However, this failure is preventing replay-dual from passing on master branch. So I raise the priority of this ticket as a blocker.

            yujian Jian Yu added a comment - We need add replay-dual back into patch review test group. However, this failure is preventing replay-dual from passing on master branch. So I raise the priority of this ticket as a blocker.

            http://review.whamcloud.com/12363 - patch was derived from Andreas fix but eliminates get_mds_num() entirely and use "lfs getstripe -M" in tests instead

            tappro Mikhail Pershin added a comment - http://review.whamcloud.com/12363 - patch was derived from Andreas fix but eliminates get_mds_num() entirely and use "lfs getstripe -M" in tests instead

            See also http://review.whamcloud.com/12149 to clean up this code a bit more.

            adilger Andreas Dilger added a comment - See also http://review.whamcloud.com/12149 to clean up this code a bit more.
            tappro Mikhail Pershin added a comment - - edited

            the reason of failure is the wrong MDS index in test:

            Starting mds0:    /mnt/mds0
            CMD: onyx-45vm3 mkdir -p /mnt/mds0; mount -t lustre   		                    /mnt/mds0
            onyx-45vm3: Usage: mount -V                 : print version
            onyx-45vm3:        mount -h                 : print this help
            onyx-45vm3:        mount                    : list mounted filesystems
            onyx-45vm3:        mount -l                 : idem, including volume labels
            onyx-45vm3: So far the informational part. Next the mounting.
            

            Further investigation shows that get_mds_dir() function in test-framework.sh was corrupted by commit 745c19c70319. Unfortunately regular testing didin't show that regression. It affects replay-dual.sh test_21b and several tests in test_27 group sanity.sh

            tappro Mikhail Pershin added a comment - - edited the reason of failure is the wrong MDS index in test: Starting mds0: /mnt/mds0 CMD: onyx-45vm3 mkdir -p /mnt/mds0; mount -t lustre /mnt/mds0 onyx-45vm3: Usage: mount -V : print version onyx-45vm3: mount -h : print this help onyx-45vm3: mount : list mounted filesystems onyx-45vm3: mount -l : idem, including volume labels onyx-45vm3: So far the informational part. Next the mounting. Further investigation shows that get_mds_dir() function in test-framework.sh was corrupted by commit 745c19c70319. Unfortunately regular testing didin't show that regression. It affects replay-dual.sh test_21b and several tests in test_27 group sanity.sh

            It looks like this started failing 100% on 2014-10-06, but wasn't noticed because replay-dual doesn't run as part of the per-patch review tests. According to the maloo test results the last passing test was commit 0b4b33592c09 "LU-5613 lustre: unused variable in tgt_brw_read()" and the first failing test was commit 6039fc8fd47, but that commit and the previous three don't appear to have anything to do with the failing test. The most likely source of this problem is commit e2677595ab7ff "LU-5003 llog: do not fix remote llogs". It would make sense to start by reverting this patch with a commit comment using Test-Parameters: testlist=replay-dual,replay-dual,replay-dual (or testing this locally) to verify if this fixes the problem.

            adilger Andreas Dilger added a comment - It looks like this started failing 100% on 2014-10-06, but wasn't noticed because replay-dual doesn't run as part of the per-patch review tests. According to the maloo test results the last passing test was commit 0b4b33592c09 " LU-5613 lustre: unused variable in tgt_brw_read()" and the first failing test was commit 6039fc8fd47, but that commit and the previous three don't appear to have anything to do with the failing test. The most likely source of this problem is commit e2677595ab7ff " LU-5003 llog: do not fix remote llogs". It would make sense to start by reverting this patch with a commit comment using Test-Parameters: testlist=replay-dual,replay-dual,replay-dual (or testing this locally) to verify if this fixes the problem.

            Mike,
            Could you please have a look at this one and comment?
            Thank you!

            jlevi Jodi Levi (Inactive) added a comment - Mike, Could you please have a look at this one and comment? Thank you!

            People

              tappro Mikhail Pershin
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: