[LU-5644] CPU stalls/heartbeat loss in cl_locks_prune Created: 19/Sep/14  Updated: 22/Dec/15  Resolved: 17/Jun/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.5

Type: Bug Priority: Minor
Reporter: Ann Koehler (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

Client: SLES11SP2/SP3 Lustre 2.5.0/2.5.1


Severity: 3
Rank (Obsolete): 15816

 Description   

Over the last several months we've had intermittent reports of CPU stalls and heartbeat failures where the offending process was executing somewhere in cl_locks_prune. Given enough time, the stuck process eventually terminated. We finally captured a dump where the node was STONITH'd at the time of heartbeat loss. I think the dump explains the stalls.

The system as a whole was experiencing network problems that prevented the client node from connecting to an OSS. At the time of the dump, the job had completed and the only thing running was a node health test against the Lustre file system. ps output shows the test had been on the runqueue for an excessive amount of time.

> crash> bt
> PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"
>  #0 [ffff880fe1e2d6d8] schedule at ffffffff8141d9f7
>  #1 [ffff880fe1e2d6e0] cl_env_info at ffffffffa0387b55 [obdclass]
>  #2 [ffff880fe1e2d6f0] cl_lock_counters at ffffffffa038fa66 [obdclass]
>  #3 [ffff880fe1e2d710] cl_lock_mutex_tail at ffffffffa038fae1 [obdclass]
>  #4 [ffff880fe1e2d730] cl_lock_mutex_get at ffffffffa0390732 [obdclass]
>  #5 [ffff880fe1e2d760] cl_locks_prune at ffffffffa0392001 [obdclass]
>  #6 [ffff880fe1e2d800] lov_delete_raid0 at ffffffffa07a3d31 [lov]
>  #7 [ffff880fe1e2d8b0] lov_object_delete at ffffffffa07a30f9 [lov]
>  #8 [ffff880fe1e2d8d0] lu_object_free at ffffffffa037fd75 [obdclass]
>  #9 [ffff880fe1e2d940] lu_object_put at ffffffffa03802ae [obdclass]
> #10 [ffff880fe1e2d9a0] cl_object_put at ffffffffa03890ae [obdclass]
> #11 [ffff880fe1e2d9b0] cl_inode_fini at ffffffffa087b56d [lustre]
> #12 [ffff880fe1e2da80] ll_clear_inode at ffffffffa0842210 [lustre]
> #13 [ffff880fe1e2dab0] ll_delete_inode at ffffffffa08428fd [lustre]
> #14 [ffff880fe1e2dae0] evict at ffffffff81171c11
> #15 [ffff880fe1e2db10] iput at ffffffff81172082
> #16 [ffff880fe1e2db40] ll_d_iput at ffffffffa0812af6 [lustre]
> #17 [ffff880fe1e2db80] d_kill at ffffffff8116c96b
> #18 [ffff880fe1e2dbb0] dput at ffffffff8116fd21
> #19 [ffff880fe1e2dbe0] fput at ffffffff8115a5a7
> #20 [ffff880fe1e2dc20] filp_close at ffffffff811563f3
> #21 [ffff880fe1e2dc50] put_files_struct at ffffffff81051e24
> #22 [ffff880fe1e2dc90] exit_files at ffffffff81051ed3
> #23 [ffff880fe1e2dcc0] do_exit at ffffffff810538ec
> #24 [ffff880fe1e2dd60] do_group_exit at ffffffff810540dc
> #25 [ffff880fe1e2dda0] get_signal_to_deliver at ffffffff81065463
> #26 [ffff880fe1e2de40] do_notify_resume at ffffffff81002360
> #27 [ffff880fe1e2df50] int_signal at ffffffff81427ff0
>     RIP: 00007ffff7b33300  RSP: 00007fffffffe908  RFLAGS: 00000246
>     RAX: fffffffffffffffc  RBX: 0000000000000000  RCX: ffffffffffffffff
>     RDX: 0000000000000060  RSI: 00000000004037c0  RDI: 0000000000000003
>     RBP: 0000000000000000   R8: 000000000060aaa0   R9: 746c61656865646f
>     R10: 00007fffffffe6b0  R11: 0000000000000246  R12: 0000000000000003
>     R13: 000000000060aaa0  R14: 00000000ffffffff  R15: 00000000ffffffff
>     ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

> crash> ps -l | grep '\^'
> ^[47137739341345] [IN]  PID: 459    TASK: ffff8810335b67f0  CPU: 12  COMMAND: "kworker/12:1"
> ^[47137739340142] [IN]  PID: 458    TASK: ffff8810335b6040  CPU: 13  COMMAND: "kworker/13:1"
[skip]
> ^[47136767282677] [IN]  PID: 462    TASK: ffff88083366e040  CPU: 9   COMMAND: "kworker/9:1"
> ^[47093087128394] [RU]  PID: 34321  TASK: ffff880fe1d517f0  CPU: 15  COMMAND: "nhc_fs_test"

Dump diving yields these values for various variables:

cl_env = ffff8810372a1e58
cl_object = ffff881037443f50
cl_object_header = ffff881037443ed0
cl_lock = ffff880fe3733df8
osc_lock = ffff8810395dcef8
cancel = 1

Of particular interest are the cll_holds and cll_flags fields.

crash> cl_lock.cll_flags ffff880fe3733df8
   cll_flags = 6      /* 6 = CLF_CANCELPEND & CLF_DOOMED */
crash> cl_lock.cll_holds ffff880fe3733df8
   cll_holds = 1

CLF_CANCELPEND is set by cl_lock_cancel when cll_holds !=0.
CLF_DOOMED is set by cl_lock_delete when cll_holds !=0.
Because these flags are set, we can infer that the while loop in cl_locks_prune has executed more than once for this particular lock. cl_locks_prune won't exit the loop until cll_holds becomes 0. Looping here for an extended period would certainly explain the heartbeat fault since nothing in the loop yields the CPU.

cll_holds is incremented in osc_lock_enqueue (among other places). Scanning the ptlrpc_request_set lists shows there's an LDLM_ENQUEUE rpc waiting to be sent (Lustre is wating for the connection to recover after the network errors).

> crash> load ptlrpc.sial
> crash> ptlrpc
> 
> Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain
> thread        ptlrpc_request      pid xid                   nid                opc  phase  bulk  sent/deadline
> ===============================================================================================
> ptlrpcd_1:    ffff881038cfdc00      0 x1475337000695248     10.10.100.15@o2ib1    6 RPC    0:0 0/0
> ptlrpcd_29:   ffff880fe380e000      0 x1475337000694880     10.10.100.15@o2ib1  101 RPC    0:0 0/0
> ===============================================================================================
opcode 101 == LDLM_ENQUEUE

> crash> ptlrpc_request.rq_async_args ffff880fe380e000
>   rq_async_args = {
>     pointer_arg = {0xffff88103bbffc00, 0xffff8810395dcf60, 0xffffffffa0716bc0 <osc_lock_upcall>, 0xffff8810395dcef8, 0xffff8810395dcf28, 0xffff8810395dcf68, 0xffff8810395dcf70, 0x0, 0x0, 0x0, 0x0}, 
>     space = {18446612202036132864, 18446612201996144480, 18446744072106372032, 18446612201996144376, 18446612201996144424, 18446612201996144488, 18446612201996144496}

Note the osc_lock arg following the upcall arg in the LDLM_ENQUEUE request matches the osc_lock referenced by the cl_lock in cl_locks_prune. So cl_locks_prune is looping on the CPU waiting for an rpc to complete and release its hold on the cl_lock.

I think the solution is to simply call cond_resched in cl_locks_prune when cll_hold !=0. Patch in progress.



 Comments   
Comment by Ann Koehler (Inactive) [ 19/Sep/14 ]

Dump uploaded to ftp.whamcloud.com:/uploads/LU-5644/LU-5644_cl_locks_prune.tgz

Comment by Ann Koehler (Inactive) [ 22/Sep/14 ]

Forgot to mention that cll_users == 0 so l_wait_event is not being executed.

crash> cl_lock.cll_users ffff880fe3733df8
cll_users = 0

Comment by Ann Koehler (Inactive) [ 23/Sep/14 ]

Patch: -http://review.whamcloud.com/12023-

Instead of doing a cond_resched when cll_holds != 0, I added the cll_holds condition to the l_wait_event already in cl_locks_prune. No point trying to delete the lock until both cll_users and cll_holds == 0.

But I wonder, is the real bug that cll_users is 0 when cll_holds is not?

Comment by Ann Koehler (Inactive) [ 24/Sep/14 ]

Adding a cll_holds condition to the l_wait_event doesn't work because cl_lock_hold_release deletes the cl_lock when it decrements the hold count. If cl_locks_prune were woken up at this point, it would reference a freed cl_lock structure.

Comment by Jinshan Xiong (Inactive) [ 26/Sep/14 ]

Hi Ann,

It's fine to wait for cll_holds to become zero and it won't reference a freed cl_lock because cl_locks_prune() holds a refcount to cl_lock.

please try patch: http://review.whamcloud.com/12080

Comment by Ann Koehler (Inactive) [ 29/Sep/14 ]

Hi Jinshan,

Thanks for the patch. I appreciate the help. I'll get it added to our builds so it can be tested but I think it should fix the problem.

Comment by Andreas Dilger [ 01/Oct/14 ]

Should http://review.whamcloud.com/12023 be abandoned in light of http://review.whamcloud.com/12080 or are they both needed?

Comment by Ann Koehler (Inactive) [ 01/Oct/14 ]

I abandoned 12023. We're going with the 12080 patch by itself.

Comment by Andreas Dilger [ 01/Dec/14 ]

Jinshan, can you please make a master version of http://review.whamcloud.com/12080 since it doesn't cherry-pick cleanly from b2_5.

Comment by Jinshan Xiong (Inactive) [ 02/Dec/14 ]

It doesn't need this patch in master due to cl_lock refactoring patch.

Comment by Ann Koehler (Inactive) [ 17/Jun/15 ]

This bug can be closed. The patch fixes the problem in b2_5 and is not needed in master.

Generated at Sat Feb 10 01:53:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.