Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5644

CPU stalls/heartbeat loss in cl_locks_prune

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: Lustre 2.5.0
    • Fix Version/s: Lustre 2.5.5
    • Labels:
    • Environment:
      Client: SLES11SP2/SP3 Lustre 2.5.0/2.5.1
    • Severity:
      3
    • Rank (Obsolete):
      15816

      Description

      Over the last several months we've had intermittent reports of CPU stalls and heartbeat failures where the offending process was executing somewhere in cl_locks_prune. Given enough time, the stuck process eventually terminated. We finally captured a dump where the node was STONITH'd at the time of heartbeat loss. I think the dump explains the stalls.

      The system as a whole was experiencing network problems that prevented the client node from connecting to an OSS. At the time of the dump, the job had completed and the only thing running was a node health test against the Lustre file system. ps output shows the test had been on the runqueue for an excessive amount of time.

      > crash> bt
      > PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"
      >  #0 [ffff880fe1e2d6d8] schedule at ffffffff8141d9f7
      >  #1 [ffff880fe1e2d6e0] cl_env_info at ffffffffa0387b55 [obdclass]
      >  #2 [ffff880fe1e2d6f0] cl_lock_counters at ffffffffa038fa66 [obdclass]
      >  #3 [ffff880fe1e2d710] cl_lock_mutex_tail at ffffffffa038fae1 [obdclass]
      >  #4 [ffff880fe1e2d730] cl_lock_mutex_get at ffffffffa0390732 [obdclass]
      >  #5 [ffff880fe1e2d760] cl_locks_prune at ffffffffa0392001 [obdclass]
      >  #6 [ffff880fe1e2d800] lov_delete_raid0 at ffffffffa07a3d31 [lov]
      >  #7 [ffff880fe1e2d8b0] lov_object_delete at ffffffffa07a30f9 [lov]
      >  #8 [ffff880fe1e2d8d0] lu_object_free at ffffffffa037fd75 [obdclass]
      >  #9 [ffff880fe1e2d940] lu_object_put at ffffffffa03802ae [obdclass]
      > #10 [ffff880fe1e2d9a0] cl_object_put at ffffffffa03890ae [obdclass]
      > #11 [ffff880fe1e2d9b0] cl_inode_fini at ffffffffa087b56d [lustre]
      > #12 [ffff880fe1e2da80] ll_clear_inode at ffffffffa0842210 [lustre]
      > #13 [ffff880fe1e2dab0] ll_delete_inode at ffffffffa08428fd [lustre]
      > #14 [ffff880fe1e2dae0] evict at ffffffff81171c11
      > #15 [ffff880fe1e2db10] iput at ffffffff81172082
      > #16 [ffff880fe1e2db40] ll_d_iput at ffffffffa0812af6 [lustre]
      > #17 [ffff880fe1e2db80] d_kill at ffffffff8116c96b
      > #18 [ffff880fe1e2dbb0] dput at ffffffff8116fd21
      > #19 [ffff880fe1e2dbe0] fput at ffffffff8115a5a7
      > #20 [ffff880fe1e2dc20] filp_close at ffffffff811563f3
      > #21 [ffff880fe1e2dc50] put_files_struct at ffffffff81051e24
      > #22 [ffff880fe1e2dc90] exit_files at ffffffff81051ed3
      > #23 [ffff880fe1e2dcc0] do_exit at ffffffff810538ec
      > #24 [ffff880fe1e2dd60] do_group_exit at ffffffff810540dc
      > #25 [ffff880fe1e2dda0] get_signal_to_deliver at ffffffff81065463
      > #26 [ffff880fe1e2de40] do_notify_resume at ffffffff81002360
      > #27 [ffff880fe1e2df50] int_signal at ffffffff81427ff0
      >     RIP: 00007ffff7b33300  RSP: 00007fffffffe908  RFLAGS: 00000246
      >     RAX: fffffffffffffffc  RBX: 0000000000000000  RCX: ffffffffffffffff
      >     RDX: 0000000000000060  RSI: 00000000004037c0  RDI: 0000000000000003
      >     RBP: 0000000000000000   R8: 000000000060aaa0   R9: 746c61656865646f
      >     R10: 00007fffffffe6b0  R11: 0000000000000246  R12: 0000000000000003
      >     R13: 000000000060aaa0  R14: 00000000ffffffff  R15: 00000000ffffffff
      >     ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
      
      > crash> ps -l | grep '\^'
      > ^[47137739341345] [IN]  PID: 459    TASK: ffff8810335b67f0  CPU: 12  COMMAND: "kworker/12:1"
      > ^[47137739340142] [IN]  PID: 458    TASK: ffff8810335b6040  CPU: 13  COMMAND: "kworker/13:1"
      [skip]
      > ^[47136767282677] [IN]  PID: 462    TASK: ffff88083366e040  CPU: 9   COMMAND: "kworker/9:1"
      > ^[47093087128394] [RU]  PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"
      

      Dump diving yields these values for various variables:

      cl_env = ffff8810372a1e58
      cl_object = ffff881037443f50
      cl_object_header = ffff881037443ed0
      cl_lock = ffff880fe3733df8
      osc_lock = ffff8810395dcef8
      cancel = 1
      

      Of particular interest are the cll_holds and cll_flags fields.

      crash> cl_lock.cll_flags ffff880fe3733df8
         cll_flags = 6      /* 6 = CLF_CANCELPEND & CLF_DOOMED */
      crash> cl_lock.cll_holds ffff880fe3733df8
         cll_holds = 1
      

      CLF_CANCELPEND is set by cl_lock_cancel when cll_holds !=0.
      CLF_DOOMED is set by cl_lock_delete when cll_holds !=0.
      Because these flags are set, we can infer that the while loop in cl_locks_prune has executed more than once for this particular lock. cl_locks_prune won't exit the loop until cll_holds becomes 0. Looping here for an extended period would certainly explain the heartbeat fault since nothing in the loop yields the CPU.

      cll_holds is incremented in osc_lock_enqueue (among other places). Scanning the ptlrpc_request_set lists shows there's an LDLM_ENQUEUE rpc waiting to be sent (Lustre is wating for the connection to recover after the network errors).

      > crash> load ptlrpc.sial
      > crash> ptlrpc
      > 
      > Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain
      > thread        ptlrpc_request      pid xid                   nid                opc  phase  bulk  sent/deadline
      > ===============================================================================================
      > ptlrpcd_1:    ffff881038cfdc00      0 x1475337000695248     10.10.100.15@o2ib1    6 RPC    0:0 0/0
      > ptlrpcd_29:   ffff880fe380e000      0 x1475337000694880     10.10.100.15@o2ib1  101 RPC    0:0 0/0
      > ===============================================================================================
      opcode 101 == LDLM_ENQUEUE
      
      > crash> ptlrpc_request.rq_async_args ffff880fe380e000
      >   rq_async_args = {
      >     pointer_arg = {0xffff88103bbffc00, 0xffff8810395dcf60, 0xffffffffa0716bc0 <osc_lock_upcall>, 0xffff8810395dcef8, 0xffff8810395dcf28, 0xffff8810395dcf68, 0xffff8810395dcf70, 0x0, 0x0, 0x0, 0x0}, 
      >     space = {18446612202036132864, 18446612201996144480, 18446744072106372032, 18446612201996144376, 18446612201996144424, 18446612201996144488, 18446612201996144496}
      

      Note the osc_lock arg following the upcall arg in the LDLM_ENQUEUE request matches the osc_lock referenced by the cl_lock in cl_locks_prune. So cl_locks_prune is looping on the CPU waiting for an rpc to complete and release its hold on the cl_lock.

      I think the solution is to simply call cond_resched in cl_locks_prune when cll_hold !=0. Patch in progress.

        Attachments

          Activity

            People

            • Assignee:
              jay Jinshan Xiong (Inactive)
              Reporter:
              amk Ann Koehler (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: