Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8287

sanity-sec test_16: mgs and c0 idmap mismatch, 10 attempts

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <andreas.dilger@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/c7f684c6-2a8e-11e6-acf3-5254006e85c2.

      This has failed quite a number of times, but was overshadowed by LU-8279 also causing a large number of sanity-sec failures.

      The sub-test test_16 failed with the following error:

      mgs and c0 idmap mismatch, 10 attempts
      

      It looks like the OST rebooted for some reason, since the modules are not loaded and the dmesg log is empty:

      trevis-54vm8: opening /dev/lnet failed: No such device
      trevis-54vm8: hint: the kernel modules may not be loaded
      trevis-54vm8: IOC_LIBCFS_GET_NI error 19: No such device
      /usr/lib64/lustre/tests/sanity-sec.sh: line 981: [: ==: unary operator expected
      

      Info required for matching: sanity-sec 16

      Attachments

        Issue Links

          Activity

            [LU-8287] sanity-sec test_16: mgs and c0 idmap mismatch, 10 attempts

            Hi Jian,

            Patch 23778 fixes an error handling bug, but the original error on ZFS still exists. The original nodemap on ZFS bug that caused this test failure is fixed in change 23849:
            http://review.whamcloud.com/#/c/23849/

            kit.westneat Kit Westneat (Inactive) added a comment - Hi Jian, Patch 23778 fixes an error handling bug, but the original error on ZFS still exists. The original nodemap on ZFS bug that caused this test failure is fixed in change 23849: http://review.whamcloud.com/#/c/23849/
            yujian Jian Yu added a comment -

            While testing patch http://review.whamcloud.com/23778 for LU-8824 on master branch, sanity-sec test 16 failed:

            nodemap.c0.idmap= [ { idtype: uid, client_id: 60003, fs_id: 60000 }, { idtype: uid, client_id: 60004, fs_id: 60002 }, { idtype: gid, client_id: 60003, fs_id: 60000 }, { idtype: gid, client_id: 60004, fs_id: 60002 } ]
            OTHER - IP: 10.2.4.167
            
             sanity-sec test_16: @@@@@@ FAIL: mgs and c0 idmap mismatch, 10 attempts 
            

            Is this a regression from that patch or the failure was not resolved on master branch?

            yujian Jian Yu added a comment - While testing patch http://review.whamcloud.com/23778 for LU-8824 on master branch, sanity-sec test 16 failed: nodemap.c0.idmap= [ { idtype: uid, client_id: 60003, fs_id: 60000 }, { idtype: uid, client_id: 60004, fs_id: 60002 }, { idtype: gid, client_id: 60003, fs_id: 60000 }, { idtype: gid, client_id: 60004, fs_id: 60002 } ] OTHER - IP: 10.2.4.167 sanity-sec test_16: @@@@@@ FAIL: mgs and c0 idmap mismatch, 10 attempts Is this a regression from that patch or the failure was not resolved on master branch?
            utopiabound Nathaniel Clark added a comment - - edited

            For each ZFS run of this test, it looks as though the OST vm has been rebooted in a previous test run because all the dmesg's are from a clean system, that hasn't had any tests from autotest run on it at all (no lctl mark statements).

            All ZFS in last 4 weeks (all failures):
            https://testing.hpdd.intel.com/test_sets/d0448e1a-93df-11e6-91aa-5254006e85c2 <-- LU-8824 (uncaught)
            https://testing.hpdd.intel.com/test_sets/969f1d60-9a60-11e6-a546-5254006e85c2 <-- LU-8824 (uncaught)
            https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2 <-- LU-8824
            https://testing.hpdd.intel.com/test_sets/b3a81182-9f06-11e6-a747-5254006e85c2 <-- LU-8824 (uncaught)
            https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2 - no OST logs for test_9
            https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2 <-- LU-8824
            https://testing.hpdd.intel.com/test_sets/6196c4ec-a63b-11e6-bf77-5254006e85c2 <-- LU-8824 (uncaught)

            LU-8824 is an LBUG in test 9 that doesn't cause a TIMEOUT, just marked as a test failure. (possibility that same LBUG is un-caught in other cases?)

            utopiabound Nathaniel Clark added a comment - - edited For each ZFS run of this test, it looks as though the OST vm has been rebooted in a previous test run because all the dmesg's are from a clean system, that hasn't had any tests from autotest run on it at all (no lctl mark statements). All ZFS in last 4 weeks (all failures): https://testing.hpdd.intel.com/test_sets/d0448e1a-93df-11e6-91aa-5254006e85c2 <-- LU-8824 (uncaught) https://testing.hpdd.intel.com/test_sets/969f1d60-9a60-11e6-a546-5254006e85c2 <-- LU-8824 (uncaught) https://testing.hpdd.intel.com/test_sets/b06c944a-9a63-11e6-a5e5-5254006e85c2 <-- LU-8824 https://testing.hpdd.intel.com/test_sets/b3a81182-9f06-11e6-a747-5254006e85c2 <-- LU-8824 (uncaught) https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2 - no OST logs for test_9 https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2 <-- LU-8824 https://testing.hpdd.intel.com/test_sets/6196c4ec-a63b-11e6-bf77-5254006e85c2 <-- LU-8824 (uncaught) LU-8824 is an LBUG in test 9 that doesn't cause a TIMEOUT, just marked as a test failure. (possibility that same LBUG is un-caught in other cases?)
            standan Saurabh Tandan (Inactive) added a comment - - edited Reopening as this issue is still seen on master. https://testing.hpdd.intel.com/test_sets/abdb13dc-a627-11e6-964e-5254006e85c2

            The root cause in that case is due to the OSS kernel panic'ing, you can see that the Lustre modules are no longer loaded:
            onyx-35vm4: opening /dev/lnet failed: No such device
            onyx-35vm4: hint: the kernel modules may not be loaded
            onyx-35vm4: IOC_LIBCFS_GET_NI error 19: No such device

            Unfortunately the console logs don't appear to have been saved, is there a way to make sure these are saved?

            kit.westneat Kit Westneat (Inactive) added a comment - The root cause in that case is due to the OSS kernel panic'ing, you can see that the Lustre modules are no longer loaded: onyx-35vm4: opening /dev/lnet failed: No such device onyx-35vm4: hint: the kernel modules may not be loaded onyx-35vm4: IOC_LIBCFS_GET_NI error 19: No such device Unfortunately the console logs don't appear to have been saved, is there a way to make sure these are saved?

            This issue is still seen on master, build# 3468 for Full EL7.2 Server/EL7.2 Client - ZFS causing following tests to fail.
            https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2

            standan Saurabh Tandan (Inactive) added a comment - This issue is still seen on master, build# 3468 for Full EL7.2 Server/EL7.2 Client - ZFS causing following tests to fail. https://testing.hpdd.intel.com/test_sets/e2e0de84-a258-11e6-bf05-5254006e85c2
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20954/
            Subject: LU-8287 nodemap: don't stop config lock when target stops
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 45ce1044cd7b94621e1161cd23c600f8e1c18317

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20954/ Subject: LU-8287 nodemap: don't stop config lock when target stops Project: fs/lustre-release Branch: master Current Patch Set: Commit: 45ce1044cd7b94621e1161cd23c600f8e1c18317
            yujian Jian Yu added a comment - One more failure instance on master: https://testing.hpdd.intel.com/test_sets/7d5b308e-594f-11e6-b5b1-5254006e85c2
            yong.fan nasf (Inactive) added a comment - Another failure instance on master: https://testing.hpdd.intel.com/test_sets/aa77a6ac-53c2-11e6-a39e-5254006e85c2

            People

              kit.westneat Kit Westneat (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: