Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8815

MDS fails to complete recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.9.0
    • Soak test cluster, tip of lustre-master lustre: 2.8.59_79_gb8811a0
    • 3
    • 9223372036854775807

    Description

      After previous failover, MDT0003 is failed back to lola-11:
      Nov 8 22:10:14 lola-10 kernel: Lustre: server umount soaked-MDT0003 complete

      MDS failover fault induced, lola-11 fails MDT0003 to lola-10:

      016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0003 ...
      2016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
      

      Recovery starts on MDS: Nov 8 22:25:29 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450

      Task block on lola-10:

      ov  8 22:29:17 lola-10 kernel: INFO: task mdt_out01_007:6305 blocked for more than 120 seconds.
      Nov  8 22:29:17 lola-10 kernel:      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
      Nov  8 22:29:17 lola-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Nov  8 22:29:17 lola-10 kernel: mdt_out01_007 D 0000000000000018     0  6305      2 0x00000080
      Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 0000000000000046 0000000000000000 ffff8808104b7d00
      Nov  8 22:29:17 lola-10 kernel: ffff8804337ea800 ffff880824300118 000014206fe1f7c1 ffff880824300118
      Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 00000001014d3a26 ffff880834add068 ffff8808104b7fd8
      Nov  8 22:29:17 lola-10 kernel: Call Trace:
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Nov  8 22:29:18 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Second blocked task:

      Nov  8 22:29:32 lola-10 kernel: Call Trace:
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Recovery timer competes, but recovery does not:

      mdt.soaked-MDT0003.recovery_status=
      status: RECOVERING
      recovery_start: 1478672734
      time_remaining: 0
      connected_clients: 21/21
      req_replay_clients: 4
      lock_repay_clients: 5
      completed_clients: 16
      evicted_clients: 0
      replayed_requests: 7
      queued_requests: 4
      next_transno: 77309522798
      

      System finally notices:

      Nov  9 08:22:35 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 588:42, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      

      Attached lustre-log dumps from lola-10

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: