Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • None
    • Lustre 2.6.0
    • 3
    • 12103

    Description

      This issue was created by maloo for wangdi <di.wang@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/e08778b8-6a67-11e3-9248-52540035b04c.

      The sub-test test_10 failed with the following error:

      Lustre: DEBUG MARKER: /usr/sbin/lctl mark == insanity test 10: Tenth Failure Mode: MDT0\/OST\/MDT1 Fri Dec 20 15:09:13 PST 2013 == 15:09:13 (1387580953)
      Lustre: DEBUG MARKER: == insanity test 10: Tenth Failure Mode: MDT0/OST/MDT1 Fri Dec 20 15:09:13 PST 2013 == 15:09:13 (1387580953)
      Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
      Lustre: DEBUG MARKER: umount -d /mnt/mds1
      Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      Lustre: DEBUG MARKER: hostname
      Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
      Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
      Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-Role_MDS/P1 /mnt/mds1
      LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts:
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 4 previous similar messages
      Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
      LNet: 20318:0:(debug.c:218:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      LNet: 20318:0:(debug.c:218:libcfs_debug_str2mask()) Skipped 3 previous similar messages
      Lustre: DEBUG MARKER: e2label /dev/lvm-Role_MDS/P1 2>/dev/null
      Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
      Lustre: 7793:0:(client.c:1903:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1387581110/real 1387581110] req@ffff88005dd0ec00 x1454975111842816/t0(0) o38->lustre-MDT0001-osp-MDT0000@10.10.16.183@tcp:24/4 lens 400/544 e 0 to 1 dl 1387581115 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      Lustre: 7793:0:(client.c:1903:ptlrpc_expire_one_request()) Skipped 12 previous similar messages
      Lustre: lustre-MDT0000: Recovery over after 0:21, of 5 clients 5 recovered and 0 were evicted.
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 122 previous similar messages
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 120 previous similar messages
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 120 previous similar messages
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 120 previous similar messages
      LustreError: 11-0: lustre-OST0000-osc-MDT0000: Communicating with 10.10.16.180@tcp, operation ost_connect failed with -16.
      LustreError: Skipped 119 previous similar messages
      SysRq : Show State

      Info required for matching: insanity 10

      Attachments

        Issue Links

          Activity

            [LU-4409] insanity test_10 (MDT0/OST/MDT1)

            TEI-3188 requests insanity test 10 be re-enabled in autotest.

            jamesanunez James Nunez (Inactive) added a comment - TEI-3188 requests insanity test 10 be re-enabled in autotest.

            Closer investigation will show that insanity test_10 is still being skipped for
            all recent lustre-review test runs, with either:

            skipping ALWAYS excluded test 10
            needs >= 2 MDTs
            

            Since it is currently passing on full test runs, it seems safe enough to file a TEI ticket to request that test_10 be removed from the autotest exception list.

            adilger Andreas Dilger added a comment - Closer investigation will show that insanity test_10 is still being skipped for all recent lustre-review test runs , with either: skipping ALWAYS excluded test 10 needs >= 2 MDTs Since it is currently passing on full test runs, it seems safe enough to file a TEI ticket to request that test_10 be removed from the autotest exception list.
            di.wang Di Wang added a comment -

            Test_10 has been re-enabled by http://review.whamcloud.com/10311. So I guess this ticket has been fixed by LU-2059.

            di.wang Di Wang added a comment - Test_10 has been re-enabled by http://review.whamcloud.com/10311 . So I guess this ticket has been fixed by LU-2059 .

            I'm happy if you can renable the test in autotest.

            adilger Andreas Dilger added a comment - I'm happy if you can renable the test in autotest.

            Is this ticket still an open problem? A patch for LU-2059, http://review.whamcloud.com/#/c/10311, re-enabled insanity test 10 for all cases except when there are less than two MDTs. Yet, insanity test 10 is still skipped since it is disabled in autotest; TEI-1312.

            Insanity test 10 is passing in recent full test sessions:
            https://testing.hpdd.intel.com/test_sessions/d7711864-b904-11e4-a983-5254006e85c2
            https://testing.hpdd.intel.com/test_sessions/63504c66-b8d7-11e4-a983-5254006e85c2
            https://testing.hpdd.intel.com/test_sessions/dba0429c-b590-11e4-9366-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - Is this ticket still an open problem? A patch for LU-2059 , http://review.whamcloud.com/#/c/10311 , re-enabled insanity test 10 for all cases except when there are less than two MDTs. Yet, insanity test 10 is still skipped since it is disabled in autotest; TEI-1312. Insanity test 10 is passing in recent full test sessions: https://testing.hpdd.intel.com/test_sessions/d7711864-b904-11e4-a983-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/63504c66-b8d7-11e4-a983-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/dba0429c-b590-11e4-9366-5254006e85c2

            This subtest is currently disabled at the autotest level, so a regular test run will skip it. We need to explicitly request testing on the patch - hopefully Test-Parameters works.

            adilger Andreas Dilger added a comment - This subtest is currently disabled at the autotest level, so a regular test run will skip it. We need to explicitly request testing on the patch - hopefully Test-Parameters works.
            di.wang Di Wang added a comment -

            I just updated the patch 8650 to fix the problem. http://review.whamcloud.com/#/c/8650/

            di.wang Di Wang added a comment - I just updated the patch 8650 to fix the problem. http://review.whamcloud.com/#/c/8650/
            di.wang Di Wang added a comment -

            Hmm, I checked the debug log, it seems in this test, it started OST before MGS(MDT0), so OST use local config log to setup MDT, then it missed the recovery process. I do not know why it just happened recently. I think we have two options here, one is always start MDT0(MGS) during the test, or using " -o abort_recov" to start MDT/OST in this test.

            di.wang Di Wang added a comment - Hmm, I checked the debug log, it seems in this test, it started OST before MGS(MDT0), so OST use local config log to setup MDT, then it missed the recovery process. I do not know why it just happened recently. I think we have two options here, one is always start MDT0(MGS) during the test, or using " -o abort_recov" to start MDT/OST in this test.

            The first failure of this test is 2013-12-07 in:
            https://maloo.whamcloud.com/test_sets/6eadd65c-5f5d-11e3-85c5-52540035b04c
            which is a test run for a patch that hasn't landed yet. There is a later test run for that patch which did not hit this problem, and the patch is not yet landed, so it cannot be the cause if failures on master.

            The master test failures only started on master on 2013-12-18 and has been hit on several different patches (none of them landed), so it is likely a regression landed to master. While it failed the 3 most recent tests, there are quite a number of passing runs for insanity test_10() over the past few weeks.

            adilger Andreas Dilger added a comment - The first failure of this test is 2013-12-07 in: https://maloo.whamcloud.com/test_sets/6eadd65c-5f5d-11e3-85c5-52540035b04c which is a test run for a patch that hasn't landed yet. There is a later test run for that patch which did not hit this problem, and the patch is not yet landed, so it cannot be the cause if failures on master. The master test failures only started on master on 2013-12-18 and has been hit on several different patches (none of them landed), so it is likely a regression landed to master. While it failed the 3 most recent tests, there are quite a number of passing runs for insanity test_10() over the past few weeks.
            di.wang Di Wang added a comment -

            Disable the test now, http://review.whamcloud.com/8650, will fix this after feature freeze.

            di.wang Di Wang added a comment - Disable the test now, http://review.whamcloud.com/8650 , will fix this after feature freeze.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: