Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10251

MDS hangs in recovery cannot abort, recovery timer is bogus

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.11.0
    • soak performance cluster
    • 3
    • 9223372036854775807

    Description

      MDS is rebooted (single MDS, no DNE)
      MDS goes into recovery, with bogus values for recovery timer.

      soak-8 login: [ 1393.056450] Lustre: soaked-MDT0000: Denying connection for new client 7af6eae0-3527-5481-d01d-161d271e4510(at 192.168.1.142@o2ib), waiting for 29 known clients (6 recovered, 21 in progress, and 2 evicted) to recover in 71565:2
      

      MDS never exits recovery, clients get -EBUSY.
      Attempting to abort_recovery causes timeouts, system still wedged.

      1681.193209] INFO: task lctl:2555 blocked for more than 120 seconds.^M
      [ 1681.271617] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
      [ 1681.368730] lctl            D ffff8803f1ecd400     0  2555   2526 0x00000084^M
      [ 1681.456456]  ffff880413a5bc10 0000000000000082 ffff8803f1826eb0 ffff880413a5bfd8^M
      [ 1681.548186]  ffff880413a5bfd8 ffff880413a5bfd8 ffff8803f1826eb0 ffff8808195014d0^M
      [ 1681.639847]  7fffffffffffffff ffff8808195014c8 ffff8803f1826eb0 ffff8803f1ecd400^M
      [ 1681.731520] Call Trace:^M
      [ 1681.763370]  [<ffffffff816a9589>] schedule+0x29/0x70^M
      [ 1681.826052]  [<ffffffff816a7099>] schedule_timeout+0x239/0x2c0^M
      [ 1681.899089]  [<ffffffff816a993d>] wait_for_completion+0xfd/0x140^M
      [ 1681.974192]  [<ffffffff810c4820>] ? wake_up_state+0x20/0x20^M
      [ 1682.044159]  [<ffffffffc10f5a5d>] target_stop_recovery_thread.part.16+0x3d/0xd0 [ptlrpc]^M
      [ 1682.144235]  [<ffffffffc10f5b08>] target_stop_recovery_thread+0x18/0x20 [ptlrpc]^M
      [ 1682.235915]  [<ffffffffc15935d0>] mdt_iocontrol+0x550/0xaf0 [mdt]^M
      [ 1682.312024]  [<ffffffffc0ef3bd9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]^M
      [ 1682.400553]  [<ffffffffc0edebb3>] class_handle_ioctl+0x1913/0x1da0 [obdclass]^M
      [ 1682.488997]  [<ffffffff812b1a98>] ? security_capable+0x18/0x20^M
      [ 1682.561806]  [<ffffffffc0ec4602>] obd_class_ioctl+0xd2/0x170 [obdclass]^M
      [ 1682.643909]  [<ffffffff812151bd>] do_vfs_ioctl+0x33d/0x540^M
      [ 1682.712431]  [<ffffffff816b0091>] ? __do_page_fault+0x171/0x450^M
      [ 1682.786103]  [<ffffffff81215461>] SyS_ioctl+0xa1/0xc0^M
      [ 1682.849308]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b^M
      

      Lustre-log, stack traces attached, we are currently forcing a kernel dump

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: