Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0
    • 3
    • 16220

    Description

      In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.

      A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:

      crash> foreach ptlrpcd_rcv bt
      PID: 17113  TASK: ffff8806323f8ae0  CPU: 22  COMMAND: "ptlrpcd_rcv"
       #0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
       #1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
       #2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
       #3 [ffff880655547ef0] notify_die at ffffffff810a12be
       #4 [ffff880655547f20] do_nmi at ffffffff8152862b
       #5 [ffff880655547f50] nmi at ffffffff81527ef0
          [exception RIP: cl_page_put+426]
          RIP: ffffffffa05f428a  RSP: ffff8805d85d9990  RFLAGS: 00000246
          RAX: 0000000000000000  RBX: ffff88047c921800  RCX: ffff8805d85d9ab8
          RDX: 0000000000000000  RSI: ffff88047c921800  RDI: ffff880bc515ee60
          RBP: ffff8805d85d99d0   R8: 0000000000000040   R9: ffff8805d85d99a0
          R10: 0000000000009500  R11: ffff880303aac8f0  R12: ffff880bc515ee60
          R13: 000000000000004a  R14: ffff880bc51d2e90  R15: ffff880bc515ee60
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
      --- <NMI exception stack> ---
       #6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
       #7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
       #8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
       #9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
      #10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
      #11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
      #12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
      #13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
      #14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
      #15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
      #16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
      #17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
      #18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
      #19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
      #20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
      crash> 
      

      the hang is caused by an endless loop in osc_lock_weight(),
      when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.

      it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.

      Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.

      Attachments

        Issue Links

          Activity

            [LU-5781] endless loop in osc_lock_weight()

            Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800
            Subject: LU-5781 ldlm: Solve a race for LRU lock cancel
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: fa78e0ef004d4569fb5e5bc08ecbe3580b42f1e1

            gerrit Gerrit Updater added a comment - Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800 Subject: LU-5781 ldlm: Solve a race for LRU lock cancel Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: fa78e0ef004d4569fb5e5bc08ecbe3580b42f1e1

            Is there a b2_5 port of http://review.whamcloud.com/12603 ?
            I have conflict in lustre/ldlm/ldlm_request.c

            jaylan Jay Lan (Inactive) added a comment - Is there a b2_5 port of http://review.whamcloud.com/12603 ? I have conflict in lustre/ldlm/ldlm_request.c
            pjones Peter Jones added a comment -

            Thanks Vitaly. I have closed this ticket and any further work can be tracked under a new ticket

            pjones Peter Jones added a comment - Thanks Vitaly. I have closed this ticket and any further work can be tracked under a new ticket

            those 2 patches turned out to be not enough, as even in new RA pages could be still under a lock sitting in LRU. another patch would be needed in addition to these two - rb tree for read extents. such a patch would be over 1k LOCs and its benefits over the current fix are not clear. thus, could be closed. thx!

            vitaly_fertman Vitaly Fertman added a comment - those 2 patches turned out to be not enough, as even in new RA pages could be still under a lock sitting in LRU. another patch would be needed in addition to these two - rb tree for read extents. such a patch would be over 1k LOCs and its benefits over the current fix are not clear. thus, could be closed. thx!
            spitzcor Cory Spitz added a comment -

            Peter, we think so, but we should ask Vitaly to be sure he agrees.

            spitzcor Cory Spitz added a comment - Peter, we think so, but we should ask Vitaly to be sure he agrees.
            pjones Peter Jones added a comment -

            Cory

            There are still a couple of patches still tracked under this ticket - http://review.whamcloud.com/#/c/12819/ and http://review.whamcloud.com/#/c/12820/1. Can these two patches be abandoned?

            Peter

            pjones Peter Jones added a comment - Cory There are still a couple of patches still tracked under this ticket - http://review.whamcloud.com/#/c/12819/ and http://review.whamcloud.com/#/c/12820/1 . Can these two patches be abandoned? Peter
            spitzcor Cory Spitz added a comment -

            With the landing of http://review.whamcloud.com/12603, it seems that this ticket should be resolved.

            spitzcor Cory Spitz added a comment - With the landing of http://review.whamcloud.com/12603 , it seems that this ticket should be resolved.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12603/
            Subject: LU-5781 ldlm: Solve a race for LRU lock cancel
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8ed3105b261fcc0816b064d6308356f645c9e12b

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12603/ Subject: LU-5781 ldlm: Solve a race for LRU lock cancel Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8ed3105b261fcc0816b064d6308356f645c9e12b

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12362/
            Subject: LU-5781 osc: osc_lock_weight endless loop fix
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 74dfa6c3f6111750c773e2484b65302026af6a53

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12362/ Subject: LU-5781 osc: osc_lock_weight endless loop fix Project: fs/lustre-release Branch: master Current Patch Set: Commit: 74dfa6c3f6111750c773e2484b65302026af6a53

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12859/
            Subject: LU-5781 osc: osc_lock_weight endless loop fix
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: ef1758a0d3cb9ac3abbe0e60ac689cf3b2aa3a6e

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12859/ Subject: LU-5781 osc: osc_lock_weight endless loop fix Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: ef1758a0d3cb9ac3abbe0e60ac689cf3b2aa3a6e

            People

              jay Jinshan Xiong (Inactive)
              zam Alexander Zarochentsev
              Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: