Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5720

lustre client hangs on possible imp_lock deadlock

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.4.2
    • Sequoia, 2.6.32-431.23.3.1chaos, github.com/chaos/lustre
    • 3
    • 16049

    Description

      The node becomes unresponsive to users and the lustre client appears to be hung after being evicted by the MDT. The node remains responsive to SysRq. After crashing the node, it boots and mounts lustre successfully.

      The symptoms develop as follows:

      First the node starts reporting connection lost/connection restored
      notices for an OST (same one repeatedly). Then the node reports it has
      been evicted by the MDT. There are then a series of failure messages
      that appear to be the normal consequence of the eviction.

      We then start seeing "spinning too long" messages from
      ptlrpc_check_set() within the ptlrpcd_rcv task, and the kernel starts
      reporting soft lockups on tasks ptlrpcd* and ll_ping. The node becomes
      unresponsive to everything other than SysRq. The operators then crash
      the node, and it comes up and mounts lustre successfully.

      Attachments

        Activity

          [LU-5720] lustre client hangs on possible imp_lock deadlock
          pjones Peter Jones added a comment -

          ok. Thanks Olaf.

          pjones Peter Jones added a comment - ok. Thanks Olaf.
          ofaaland Olaf Faaland added a comment - - edited

          After rebuilding the kernel with CONFIG_DEBUG_SPINLOCK enabled, we stopped seeing the problem. You can close it. We won't have any additional information to help track it down.

          ofaaland Olaf Faaland added a comment - - edited After rebuilding the kernel with CONFIG_DEBUG_SPINLOCK enabled, we stopped seeing the problem. You can close it. We won't have any additional information to help track it down.
          ofaaland Olaf Faaland added a comment -

          I will check and get back to you.

          ofaaland Olaf Faaland added a comment - I will check and get back to you.

          Any further report on this issue? Can we close it?

          niu Niu Yawei (Inactive) added a comment - Any further report on this issue? Can we close it?
          green Oleg Drokin added a comment -

          Ok.
          Lat time the problem did go away magically I believe.
          Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.

          green Oleg Drokin added a comment - Ok. Lat time the problem did go away magically I believe. Please keep us informed about occurences (or lack of them) with the spinlock-debug enabled kernel.
          ofaaland Olaf Faaland added a comment - - edited

          This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday.

          At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change).

          ofaaland Olaf Faaland added a comment - - edited This is a current problem. We are hitting it intermittently. We have 6 lac nodes on sequoia, most recent incident was last Tuesday. At the end of today we plan to install the kernel that has spinlock debug turned on, so we will have more information next time it happens (if it doesn't magically go away with the kernel change).
          green Oleg Drokin added a comment -

          Hm, I see. This is not really promising then.

          Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?

          green Oleg Drokin added a comment - Hm, I see. This is not really promising then. Is this still something that hits for you regularly now or is this like in the past hit a few times and then stopped?

          No, 110 tasks were reported as last running on cpu 10.

          There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*):
          schedule
          cfs_waitq_wait
          ldlm_bl_thread_main

          1 like this (comm=ldlm_cb02_007):
          cfs_waitq_wait
          ptlrpc_wait_event
          ptlrpc_main

          the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.

          ofaaland Olaf Faaland added a comment - No, 110 tasks were reported as last running on cpu 10. There were 10 tasks that last ran on cpu 10 with stacks like this (comm=ldlm_bl_*): schedule cfs_waitq_wait ldlm_bl_thread_main 1 like this (comm=ldlm_cb02_007): cfs_waitq_wait ptlrpc_wait_event ptlrpc_main the other 98 tasks that had last run on cpu10 had no lustre functions in their stacks.
          green Oleg Drokin added a comment -

          Just to confirm, is this the only task that is reported as last run on cpu 10?

          green Oleg Drokin added a comment - Just to confirm, is this the only task that is reported as last run on cpu 10?
          ofaaland Olaf Faaland added a comment -

          Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is:

          schedule
          __mutex_lock_slowpath
          mutex_lock
          mdc_enqueue
          lmv_enqueue
          ll_layout_refresh
          vvp_io_init
          cl_io_init0
          ll_file_io_generic
          ll_file_aio_write
          do_sync_readv_writev
          do_readv_writev
          sys_writev

          I don't see that it would be holding an import lock, but I may have missed something.

          ofaaland Olaf Faaland added a comment - Oleg, we found one task that ps indicates last ran on CPU #10, and was sleeping at the time the node was crashed. Task 15059, comm=srun. Its backtrace is: schedule __mutex_lock_slowpath mutex_lock mdc_enqueue lmv_enqueue ll_layout_refresh vvp_io_init cl_io_init0 ll_file_io_generic ll_file_aio_write do_sync_readv_writev do_readv_writev sys_writev I don't see that it would be holding an import lock, but I may have missed something.

          People

            niu Niu Yawei (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: