Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5644

CPU stalls/heartbeat loss in cl_locks_prune

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.5.5
    • Lustre 2.5.0
    • Client: SLES11SP2/SP3 Lustre 2.5.0/2.5.1
    • 3
    • 15816

    Description

      Over the last several months we've had intermittent reports of CPU stalls and heartbeat failures where the offending process was executing somewhere in cl_locks_prune. Given enough time, the stuck process eventually terminated. We finally captured a dump where the node was STONITH'd at the time of heartbeat loss. I think the dump explains the stalls.

      The system as a whole was experiencing network problems that prevented the client node from connecting to an OSS. At the time of the dump, the job had completed and the only thing running was a node health test against the Lustre file system. ps output shows the test had been on the runqueue for an excessive amount of time.

      > crash> bt
      > PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"
      >  #0 [ffff880fe1e2d6d8] schedule at ffffffff8141d9f7
      >  #1 [ffff880fe1e2d6e0] cl_env_info at ffffffffa0387b55 [obdclass]
      >  #2 [ffff880fe1e2d6f0] cl_lock_counters at ffffffffa038fa66 [obdclass]
      >  #3 [ffff880fe1e2d710] cl_lock_mutex_tail at ffffffffa038fae1 [obdclass]
      >  #4 [ffff880fe1e2d730] cl_lock_mutex_get at ffffffffa0390732 [obdclass]
      >  #5 [ffff880fe1e2d760] cl_locks_prune at ffffffffa0392001 [obdclass]
      >  #6 [ffff880fe1e2d800] lov_delete_raid0 at ffffffffa07a3d31 [lov]
      >  #7 [ffff880fe1e2d8b0] lov_object_delete at ffffffffa07a30f9 [lov]
      >  #8 [ffff880fe1e2d8d0] lu_object_free at ffffffffa037fd75 [obdclass]
      >  #9 [ffff880fe1e2d940] lu_object_put at ffffffffa03802ae [obdclass]
      > #10 [ffff880fe1e2d9a0] cl_object_put at ffffffffa03890ae [obdclass]
      > #11 [ffff880fe1e2d9b0] cl_inode_fini at ffffffffa087b56d [lustre]
      > #12 [ffff880fe1e2da80] ll_clear_inode at ffffffffa0842210 [lustre]
      > #13 [ffff880fe1e2dab0] ll_delete_inode at ffffffffa08428fd [lustre]
      > #14 [ffff880fe1e2dae0] evict at ffffffff81171c11
      > #15 [ffff880fe1e2db10] iput at ffffffff81172082
      > #16 [ffff880fe1e2db40] ll_d_iput at ffffffffa0812af6 [lustre]
      > #17 [ffff880fe1e2db80] d_kill at ffffffff8116c96b
      > #18 [ffff880fe1e2dbb0] dput at ffffffff8116fd21
      > #19 [ffff880fe1e2dbe0] fput at ffffffff8115a5a7
      > #20 [ffff880fe1e2dc20] filp_close at ffffffff811563f3
      > #21 [ffff880fe1e2dc50] put_files_struct at ffffffff81051e24
      > #22 [ffff880fe1e2dc90] exit_files at ffffffff81051ed3
      > #23 [ffff880fe1e2dcc0] do_exit at ffffffff810538ec
      > #24 [ffff880fe1e2dd60] do_group_exit at ffffffff810540dc
      > #25 [ffff880fe1e2dda0] get_signal_to_deliver at ffffffff81065463
      > #26 [ffff880fe1e2de40] do_notify_resume at ffffffff81002360
      > #27 [ffff880fe1e2df50] int_signal at ffffffff81427ff0
      >     RIP: 00007ffff7b33300  RSP: 00007fffffffe908  RFLAGS: 00000246
      >     RAX: fffffffffffffffc  RBX: 0000000000000000  RCX: ffffffffffffffff
      >     RDX: 0000000000000060  RSI: 00000000004037c0  RDI: 0000000000000003
      >     RBP: 0000000000000000   R8: 000000000060aaa0   R9: 746c61656865646f
      >     R10: 00007fffffffe6b0  R11: 0000000000000246  R12: 0000000000000003
      >     R13: 000000000060aaa0  R14: 00000000ffffffff  R15: 00000000ffffffff
      >     ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
      
      > crash> ps -l | grep '\^'
      > ^[47137739341345] [IN]  PID: 459    TASK: ffff8810335b67f0  CPU: 12  COMMAND: "kworker/12:1"
      > ^[47137739340142] [IN]  PID: 458    TASK: ffff8810335b6040  CPU: 13  COMMAND: "kworker/13:1"
      [skip]
      > ^[47136767282677] [IN]  PID: 462    TASK: ffff88083366e040  CPU: 9   COMMAND: "kworker/9:1"
      > ^[47093087128394] [RU]  PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"
      

      Dump diving yields these values for various variables:

      cl_env = ffff8810372a1e58
      cl_object = ffff881037443f50
      cl_object_header = ffff881037443ed0
      cl_lock = ffff880fe3733df8
      osc_lock = ffff8810395dcef8
      cancel = 1
      

      Of particular interest are the cll_holds and cll_flags fields.

      crash> cl_lock.cll_flags ffff880fe3733df8
         cll_flags = 6      /* 6 = CLF_CANCELPEND & CLF_DOOMED */
      crash> cl_lock.cll_holds ffff880fe3733df8
         cll_holds = 1
      

      CLF_CANCELPEND is set by cl_lock_cancel when cll_holds !=0.
      CLF_DOOMED is set by cl_lock_delete when cll_holds !=0.
      Because these flags are set, we can infer that the while loop in cl_locks_prune has executed more than once for this particular lock. cl_locks_prune won't exit the loop until cll_holds becomes 0. Looping here for an extended period would certainly explain the heartbeat fault since nothing in the loop yields the CPU.

      cll_holds is incremented in osc_lock_enqueue (among other places). Scanning the ptlrpc_request_set lists shows there's an LDLM_ENQUEUE rpc waiting to be sent (Lustre is wating for the connection to recover after the network errors).

      > crash> load ptlrpc.sial
      > crash> ptlrpc
      > 
      > Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain
      > thread        ptlrpc_request      pid xid                   nid                opc  phase  bulk  sent/deadline
      > ===============================================================================================
      > ptlrpcd_1:    ffff881038cfdc00      0 x1475337000695248     10.10.100.15@o2ib1    6 RPC    0:0 0/0
      > ptlrpcd_29:   ffff880fe380e000      0 x1475337000694880     10.10.100.15@o2ib1  101 RPC    0:0 0/0
      > ===============================================================================================
      opcode 101 == LDLM_ENQUEUE
      
      > crash> ptlrpc_request.rq_async_args ffff880fe380e000
      >   rq_async_args = {
      >     pointer_arg = {0xffff88103bbffc00, 0xffff8810395dcf60, 0xffffffffa0716bc0 <osc_lock_upcall>, 0xffff8810395dcef8, 0xffff8810395dcf28, 0xffff8810395dcf68, 0xffff8810395dcf70, 0x0, 0x0, 0x0, 0x0}, 
      >     space = {18446612202036132864, 18446612201996144480, 18446744072106372032, 18446612201996144376, 18446612201996144424, 18446612201996144488, 18446612201996144496}
      

      Note the osc_lock arg following the upcall arg in the LDLM_ENQUEUE request matches the osc_lock referenced by the cl_lock in cl_locks_prune. So cl_locks_prune is looping on the CPU waiting for an rpc to complete and release its hold on the cl_lock.

      I think the solution is to simply call cond_resched in cl_locks_prune when cll_hold !=0. Patch in progress.

      Attachments

        Activity

          [LU-5644] CPU stalls/heartbeat loss in cl_locks_prune

          This bug can be closed. The patch fixes the problem in b2_5 and is not needed in master.

          amk Ann Koehler (Inactive) added a comment - This bug can be closed. The patch fixes the problem in b2_5 and is not needed in master.

          It doesn't need this patch in master due to cl_lock refactoring patch.

          jay Jinshan Xiong (Inactive) added a comment - It doesn't need this patch in master due to cl_lock refactoring patch.

          Jinshan, can you please make a master version of http://review.whamcloud.com/12080 since it doesn't cherry-pick cleanly from b2_5.

          adilger Andreas Dilger added a comment - Jinshan, can you please make a master version of http://review.whamcloud.com/12080 since it doesn't cherry-pick cleanly from b2_5.

          I abandoned 12023. We're going with the 12080 patch by itself.

          amk Ann Koehler (Inactive) added a comment - I abandoned 12023. We're going with the 12080 patch by itself.

          Should http://review.whamcloud.com/12023 be abandoned in light of http://review.whamcloud.com/12080 or are they both needed?

          adilger Andreas Dilger added a comment - Should http://review.whamcloud.com/12023 be abandoned in light of http://review.whamcloud.com/12080 or are they both needed?

          Hi Jinshan,

          Thanks for the patch. I appreciate the help. I'll get it added to our builds so it can be tested but I think it should fix the problem.

          amk Ann Koehler (Inactive) added a comment - Hi Jinshan, Thanks for the patch. I appreciate the help. I'll get it added to our builds so it can be tested but I think it should fix the problem.

          Hi Ann,

          It's fine to wait for cll_holds to become zero and it won't reference a freed cl_lock because cl_locks_prune() holds a refcount to cl_lock.

          please try patch: http://review.whamcloud.com/12080

          jay Jinshan Xiong (Inactive) added a comment - Hi Ann, It's fine to wait for cll_holds to become zero and it won't reference a freed cl_lock because cl_locks_prune() holds a refcount to cl_lock. please try patch: http://review.whamcloud.com/12080

          Adding a cll_holds condition to the l_wait_event doesn't work because cl_lock_hold_release deletes the cl_lock when it decrements the hold count. If cl_locks_prune were woken up at this point, it would reference a freed cl_lock structure.

          amk Ann Koehler (Inactive) added a comment - Adding a cll_holds condition to the l_wait_event doesn't work because cl_lock_hold_release deletes the cl_lock when it decrements the hold count. If cl_locks_prune were woken up at this point, it would reference a freed cl_lock structure.
          amk Ann Koehler (Inactive) added a comment - - edited

          Patch: -http://review.whamcloud.com/12023-

          Instead of doing a cond_resched when cll_holds != 0, I added the cll_holds condition to the l_wait_event already in cl_locks_prune. No point trying to delete the lock until both cll_users and cll_holds == 0.

          But I wonder, is the real bug that cll_users is 0 when cll_holds is not?

          amk Ann Koehler (Inactive) added a comment - - edited Patch: - http://review.whamcloud.com/12023- Instead of doing a cond_resched when cll_holds != 0, I added the cll_holds condition to the l_wait_event already in cl_locks_prune. No point trying to delete the lock until both cll_users and cll_holds == 0. But I wonder, is the real bug that cll_users is 0 when cll_holds is not?

          Forgot to mention that cll_users == 0 so l_wait_event is not being executed.

          crash> cl_lock.cll_users ffff880fe3733df8
          cll_users = 0

          amk Ann Koehler (Inactive) added a comment - Forgot to mention that cll_users == 0 so l_wait_event is not being executed. crash> cl_lock.cll_users ffff880fe3733df8 cll_users = 0

          People

            jay Jinshan Xiong (Inactive)
            amk Ann Koehler (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: