Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10950

replay-single test_0c: post-failover df failed

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.2, Lustre 2.12.5, Lustre 2.12.6
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/3f06735e-47a3-11e8-960d-52540065bddc

      test_0c failed with the following error:

      post-failover df failed
      

      MDS console

      [54749.278470] Lustre: DEBUG MARKER: dmesg
      [54750.063053] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 \(1524487673\)
      [54750.285298] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 12:47:53 (1524487673)
      [54771.807392] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [54772.067740] Lustre: DEBUG MARKER: trevis-3vm4.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
      [54774.441472] Lustre: Evicted from MGS (at 10.9.4.19@tcp) after server handle changed from 0x3be7d72ea5628d3a to 0x3be7d72ea5629b9c
      [54774.447132] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [54774.451853] LustreError: Skipped 1 previous similar message
      [54811.662136] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
      [54811.665037] LustreError: Skipped 1 previous similar message
      [54871.629282] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
      [54931.629799] LustreError: 11-0: lustre-MDT0000-lwp-MDT0001: operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11
      [54953.483279] LustreError: 11-0: lustre-MDT0000-osp-MDT0001: operation ldlm_enqueue to node 10.9.4.19@tcp failed: rc = -107
      [54954.441836] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [54954.447019] LustreError: Skipped 1 previous similar message
      [54955.493750] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0c: @@@@@@ FAIL: post-failover df failed 
      [54955.689555] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      replay-single test_0c - post-failover df failed

      Attachments

        Issue Links

          Activity

            [LU-10950] replay-single test_0c: post-failover df failed

            I looks like we are still seeing this issue on the b2_12 branch (2.12.5 RC1) at https://testing.whamcloud.com/test_sets/b1f9a784-03da-4f5d-8586-8f22b0ec1803 . In the MDS 2, 4 console, we don't see the "operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11", but se do see the ldlm_enqueue error

            [68319.207306] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 16:40:20 (1590856820)
            [68341.027124] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 16:40:42 \(1590856842\)
            [68341.238283] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 16:40:42 (1590856842)
            [68341.608159] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
            [68341.811858] Lustre: DEBUG MARKER: trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
            [68342.615134] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            [68342.617641] LustreError: Skipped 1 previous similar message
            [68342.618663] Lustre: Evicted from MGS (at 10.9.1.247@tcp) after server handle changed from 0x6113ba905ec7fb33 to 0x6113ba905ec808b5
            [68521.699520] LustreError: 11-0: lustre-MDT0000-osp-MDT0003: operation ldlm_enqueue to node 10.9.1.247@tcp failed: rc = -107
            [68522.897958] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
            [68522.900333] LustreError: Skipped 1 previous similar message
            [68527.244961] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_0c: @@@@@@ FAIL: post-failover df failed 
            [68527.479305] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed
            

            On the MDS1, 3 console, we see

            [68493.204660] Lustre: lustre-MDT0000: Denying connection for new client 00c98053-78c2-d29f-2efa-cfe5ab2d84b0 (at 10.9.1.244@tcp), waiting for 5 known clients (0 recovered, 4 in progress, and 0 evicted) to recover in 0:28
            [68493.208019] Lustre: Skipped 12 previous similar messages
            [68521.668216] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
            [68521.669631] Lustre: lustre-MDT0000: disconnecting 1 stale clients
            [68521.670781] Lustre: 824:0:(ldlm_lib.c:1782:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reached hard limit: 180, extend: 1
            [68521.673037] Lustre: 824:0:(ldlm_lib.c:2063:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout
            [68521.674849] Lustre: 824:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery
            [68521.676769] LustreError: 824:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152
            [68521.678875] Lustre: 824:0:(ldlm_lib.c:1616:abort_req_replay_queue()) @@@ aborted:  req@ffff8a16576ce880 x1668123554855936/t0(8589934655) o36->c051fb07-a55c-c0a1-229e-7286a90a1044@10.9.1.245@tcp:309/0 lens 536/0 e 2 to 0 dl 1590857034 ref 1 fl Complete:/4/ffffffff rc 0/-1
            [68521.683316] LustreError: 824:0:(ldlm_lib.c:1637:abort_lock_replay_queue()) @@@ aborted:  req@ffff8a164e504050 x1668118635732736/t0(0) o101->lustre-MDT0003-mdtlov_UUID@10.9.1.253@tcp:312/0 lens 328/0 e 6 to 0 dl 1590857037 ref 1 fl Complete:/40/ffffffff rc 0/-1
            [68521.687295] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation ldlm_enqueue to node 0@lo failed: rc = -107
            

            Is this the same issue as described in this ticket?

            jamesanunez James Nunez (Inactive) added a comment - I looks like we are still seeing this issue on the b2_12 branch (2.12.5 RC1) at https://testing.whamcloud.com/test_sets/b1f9a784-03da-4f5d-8586-8f22b0ec1803 . In the MDS 2, 4 console, we don't see the "operation quota_acquire to node 10.9.4.19@tcp failed: rc = -11", but se do see the ldlm_enqueue error [68319.207306] Lustre: DEBUG MARKER: == replay-single test 0c: check replay-barrier ======================================================= 16:40:20 (1590856820) [68341.027124] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 16:40:42 \(1590856842\) [68341.238283] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 16:40:42 (1590856842) [68341.608159] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [68341.811858] Lustre: DEBUG MARKER: trevis-57vm4.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4 [68342.615134] LustreError: 167-0: lustre-MDT0000-lwp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [68342.617641] LustreError: Skipped 1 previous similar message [68342.618663] Lustre: Evicted from MGS (at 10.9.1.247@tcp) after server handle changed from 0x6113ba905ec7fb33 to 0x6113ba905ec808b5 [68521.699520] LustreError: 11-0: lustre-MDT0000-osp-MDT0003: operation ldlm_enqueue to node 10.9.1.247@tcp failed: rc = -107 [68522.897958] LustreError: 167-0: lustre-MDT0000-osp-MDT0001: This client was evicted by lustre-MDT0000; in progress operations using this service will fail. [68522.900333] LustreError: Skipped 1 previous similar message [68527.244961] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_0c: @@@@@@ FAIL: post-failover df failed [68527.479305] Lustre: DEBUG MARKER: replay-single test_0c: @@@@@@ FAIL: post-failover df failed On the MDS1, 3 console, we see [68493.204660] Lustre: lustre-MDT0000: Denying connection for new client 00c98053-78c2-d29f-2efa-cfe5ab2d84b0 (at 10.9.1.244@tcp), waiting for 5 known clients (0 recovered, 4 in progress, and 0 evicted) to recover in 0:28 [68493.208019] Lustre: Skipped 12 previous similar messages [68521.668216] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports [68521.669631] Lustre: lustre-MDT0000: disconnecting 1 stale clients [68521.670781] Lustre: 824:0:(ldlm_lib.c:1782:extend_recovery_timer()) lustre-MDT0000: extended recovery timer reached hard limit: 180, extend: 1 [68521.673037] Lustre: 824:0:(ldlm_lib.c:2063:target_recovery_overseer()) lustre-MDT0000 recovery is aborted by hard timeout [68521.674849] Lustre: 824:0:(ldlm_lib.c:2073:target_recovery_overseer()) recovery is aborted, evict exports in recovery [68521.676769] LustreError: 824:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 0 != fo_tot_granted 2097152 [68521.678875] Lustre: 824:0:(ldlm_lib.c:1616:abort_req_replay_queue()) @@@ aborted: req@ffff8a16576ce880 x1668123554855936/t0(8589934655) o36->c051fb07-a55c-c0a1-229e-7286a90a1044@10.9.1.245@tcp:309/0 lens 536/0 e 2 to 0 dl 1590857034 ref 1 fl Complete:/4/ffffffff rc 0/-1 [68521.683316] LustreError: 824:0:(ldlm_lib.c:1637:abort_lock_replay_queue()) @@@ aborted: req@ffff8a164e504050 x1668118635732736/t0(0) o101->lustre-MDT0003-mdtlov_UUID@10.9.1.253@tcp:312/0 lens 328/0 e 6 to 0 dl 1590857037 ref 1 fl Complete:/40/ffffffff rc 0/-1 [68521.687295] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation ldlm_enqueue to node 0@lo failed: rc = -107 Is this the same issue as described in this ticket?
            simmonsja James A Simmons added a comment - Can you try  https://review.whamcloud.com/#/c/35627.

            It looks like we are still seeing this issue even after patch https://review.whamcloud.com/36274/ landed to master. Please see https://testing.whamcloud.com/test_sets/42920818-fe12-11e9-8e77-52540065bddc for a recent failure.

            jamesanunez James Nunez (Inactive) added a comment - It looks like we are still seeing this issue even after patch https://review.whamcloud.com/36274/ landed to master. Please see https://testing.whamcloud.com/test_sets/42920818-fe12-11e9-8e77-52540065bddc for a recent failure.

            This should be fixed by https://review.whamcloud.com/#/c/36274. If not we can reopen.

            simmonsja James A Simmons added a comment - This should be fixed by  https://review.whamcloud.com/#/c/36274.  If not we can reopen.

            Potential fix for LU-12769 should resolve this

            simmonsja James A Simmons added a comment - Potential fix for LU-12769 should resolve this
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/a167544c-dff0-11e8-a251-52540065bddc

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: