Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6084

Tests are failed due to 'recovery is aborted by hard timeout'

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.7.0
    • Fix Version/s: Lustre 2.7.0
    • Labels:
    • Environment:
      lustre-master build #2770
    • Severity:
      3
    • Rank (Obsolete):
      16931

      Description

      Many recovery tests start to fail because unexpected recovery abort due to hard timeout.

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/722432b2-80fa-11e4-9c9a-5254006e85c2.

      The sub-test test_4k failed with the following error:

      onyx-35vm1.onyx.hpdd.intel.com evicted
      

      MDS dmesg

      Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0000-lwp-OST0000_UUID (at 10.2.4.141@tcp), waiting for all 6 known clients (0 recovered, 5 in progress, and 0 evicted) to recover in 0:25
      Lustre: Skipped 90 previous similar messages
      INFO: task tgt_recov:2119 blocked for more than 120 seconds.
            Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      tgt_recov     D 0000000000000000     0  2119      2 0x00000080
       ffff88006fb2fda0 0000000000000046 0000000000000000 ffff880002316880
       ffff88006fb2fd10 ffffffff81030b59 ffff88006fb2fd20 ffffffff810554f8
       ffff88006faad058 ffff88006fb2ffd8 000000000000fbc8 ffff88006faad058
      Call Trace:
       [<ffffffff81030b59>] ? native_smp_send_reschedule+0x49/0x60
       [<ffffffff810554f8>] ? resched_task+0x68/0x80
       [<ffffffff8109b2ce>] ? prepare_to_wait+0x4e/0x80
       [<ffffffffa080d9c0>] ? check_for_clients+0x0/0x70 [ptlrpc]
       [<ffffffffa080ef2d>] target_recovery_overseer+0xad/0x2d0 [ptlrpc]
       [<ffffffffa080d610>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
       [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
       [<ffffffffa0815f34>] target_recovery_thread+0x6e4/0x1a20 [ptlrpc]
       [<ffffffff81061d12>] ? default_wake_function+0x12/0x20
       [<ffffffffa0815850>] ? target_recovery_thread+0x0/0x1a20 [ptlrpc]
       [<ffffffff8109abf6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ab60>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
      Lustre: lustre-MDT0000: disconnecting 1 stale clients
      Lustre: 2119:0:(ldlm_lib.c:1767:target_recovery_overseer()) recovery is aborted by hard timeout
      Lustre: 2119:0:(ldlm_lib.c:1773:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      Lustre: 2119:0:(ldlm_lib.c:1415:abort_req_replay_queue()) @@@ aborted:  req@ffff880079bf6980 x1487142925659804/t0(38654705688) o36->c0baea22-119d-b8af-1550-c0592a66b0c4@10.2.4.138@tcp:277/0 lens 520/0 e 0 to 0 dl 1418252677 ref 1 fl Complete:/4/ffffffff rc 0/-1
      Lustre: lustre-MDT0000: Recovery over after 3:00, of 6 clients 0 recovered and 6 were evicted.
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-vbr test_4k: @@@@@@ FAIL: onyx-35vm1.onyx.hpdd.intel.com evicted 
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              tappro Mikhail Pershin
              Reporter:
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: