Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.9.0
    • Soak test cluster, tip of lustre-master lustre: 2.8.59_79_gb8811a0
    • 3
    • 9223372036854775807

    Description

      After previous failover, MDT0003 is failed back to lola-11:
      Nov 8 22:10:14 lola-10 kernel: Lustre: server umount soaked-MDT0003 complete

      MDS failover fault induced, lola-11 fails MDT0003 to lola-10:

      016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0003 ...
      2016-11-08 22:24:25,959:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0003 on lola-10 ...
      

      Recovery starts on MDS: Nov 8 22:25:29 lola-10 kernel: Lustre: soaked-MDT0003: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450

      Task block on lola-10:

      ov  8 22:29:17 lola-10 kernel: INFO: task mdt_out01_007:6305 blocked for more than 120 seconds.
      Nov  8 22:29:17 lola-10 kernel:      Tainted: P           -- ------------    2.6.32-573.26.1.el6_lustre.x86_64 #1
      Nov  8 22:29:17 lola-10 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Nov  8 22:29:17 lola-10 kernel: mdt_out01_007 D 0000000000000018     0  6305      2 0x00000080
      Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 0000000000000046 0000000000000000 ffff8808104b7d00
      Nov  8 22:29:17 lola-10 kernel: ffff8804337ea800 ffff880824300118 000014206fe1f7c1 ffff880824300118
      Nov  8 22:29:17 lola-10 kernel: ffff8808104b7b40 00000001014d3a26 ffff880834add068 ffff8808104b7fd8
      Nov  8 22:29:17 lola-10 kernel: Call Trace:
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
      Nov  8 22:29:17 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  8 22:29:17 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Nov  8 22:29:18 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Second blocked task:

      Nov  8 22:29:32 lola-10 kernel: Call Trace:
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bca3d>] lu_object_find_at+0x3d/0xe0 [obdclass]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09dd442>] ? __req_capsule_get+0x162/0x6e0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff81067650>] ? default_wake_function+0x0/0x20
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09b2af0>] ? lustre_swab_object_update_reply+0x0/0xc0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa07bd7cc>] dt_locate_at+0x1c/0xa0 [obdclass]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a24147>] out_handle+0x1067/0x18d0 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105e9b6>] ? enqueue_task+0x66/0x80
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8105ab8d>] ? check_preempt_curr+0x6d/0x90
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa067ac8a>] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8153afce>] ? mutex_lock+0x1e/0x50
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a142ca>] ? req_can_reconstruct+0x6a/0x120 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa0a1b4bc>] tgt_request_handle+0x8ec/0x1440 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c7bc1>] ptlrpc_main+0xd31/0x1800 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff81539b0e>] ? thread_return+0x4e/0x7d0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffffa09c6e90>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a138e>] kthread+0x9e/0xc0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      Nov  8 22:29:32 lola-10 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      

      Recovery timer competes, but recovery does not:

      mdt.soaked-MDT0003.recovery_status=
      status: RECOVERING
      recovery_start: 1478672734
      time_remaining: 0
      connected_clients: 21/21
      req_replay_clients: 4
      lock_repay_clients: 5
      completed_clients: 16
      evicted_clients: 0
      replayed_requests: 7
      queued_requests: 4
      next_transno: 77309522798
      

      System finally notices:

      Nov  9 08:22:35 lola-10 kernel: Lustre: soaked-MDT0003: Recovery already passed deadline 588:42, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      

      Attached lustre-log dumps from lola-10

      Attachments

        Issue Links

          Activity

            [LU-8815] MDS fails to complete recovery

            Another instance of LU-9049.

            yong.fan nasf (Inactive) added a comment - Another instance of LU-9049 .
            pjones Peter Jones added a comment -

            Ok I will remove the fix version for the time being and we can continue to monitor

            pjones Peter Jones added a comment - Ok I will remove the fix version for the time being and we can continue to monitor

            We have not seen a recovery failure since that time. However, we've had other hard bugs that reduced runtime in some cases.

            cliffw Cliff White (Inactive) added a comment - We have not seen a recovery failure since that time. However, we've had other hard bugs that reduced runtime in some cases.

            It may be that this is fixed with patch https://review.whamcloud.com/26965 "LU-9049 obdclass: unhash object when destroying object".

            adilger Andreas Dilger added a comment - It may be that this is fixed with patch https://review.whamcloud.com/26965 " LU-9049 obdclass: unhash object when destroying object ".
            yong.fan nasf (Inactive) added a comment - New build: https://build.hpdd.intel.com/job/lustre-reviews/46754/
            cliffw Cliff White (Inactive) added a comment - - edited

            We have also moved to RHEL 7 with new soak, can you re-trigger your build? Jenkins no longer has the bits. I get RPMS from Jenkins.

            cliffw Cliff White (Inactive) added a comment - - edited We have also moved to RHEL 7 with new soak, can you re-trigger your build? Jenkins no longer has the bits. I get RPMS from Jenkins.

            https://review.whamcloud.com/#/c/25260/
            set 4 build status:
            Overall Build Status: SUCCESS
            Builder CentOS 6.7 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.7%20x86_64%20%28BUILD%29/builds/7691
            Builder CentOS 6.8 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.8%20x86_64%20%28BUILD%29/builds/3827
            Builder CentOS 7.2 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%207.2%20x86_64%20%28BUILD%29/builds/7677
            Builder Ubuntu 14.04 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/Ubuntu%2014.04%20x86_64%20%28BUILD%29/builds/7637

            yong.fan nasf (Inactive) added a comment - https://review.whamcloud.com/#/c/25260/ set 4 build status: Overall Build Status: SUCCESS Builder CentOS 6.7 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.7%20x86_64%20%28BUILD%29/builds/7691 Builder CentOS 6.8 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%206.8%20x86_64%20%28BUILD%29/builds/3827 Builder CentOS 7.2 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/CentOS%207.2%20x86_64%20%28BUILD%29/builds/7677 Builder Ubuntu 14.04 x86_64 (BUILD) succeeded (build successful) - http://build.lustre.org/builders/Ubuntu%2014.04%20x86_64%20%28BUILD%29/builds/7637

            We have been busy with the move, but have not seen any recovery issues. I should be able to try the patch this week, is there a build?

            cliffw Cliff White (Inactive) added a comment - We have been busy with the move, but have not seen any recovery issues. I should be able to try the patch this week, is there a build?

            Any update with the patch applied? Thanks!

            yong.fan nasf (Inactive) added a comment - Any update with the patch applied? Thanks!

            We are moving soak hardware this week, will add that to the list.

            cliffw Cliff White (Inactive) added a comment - We are moving soak hardware this week, will add that to the list.

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: