Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.5.0
-
Client: SLES11SP2/SP3 Lustre 2.5.0/2.5.1
-
3
-
15816
Description
Over the last several months we've had intermittent reports of CPU stalls and heartbeat failures where the offending process was executing somewhere in cl_locks_prune. Given enough time, the stuck process eventually terminated. We finally captured a dump where the node was STONITH'd at the time of heartbeat loss. I think the dump explains the stalls.
The system as a whole was experiencing network problems that prevented the client node from connecting to an OSS. At the time of the dump, the job had completed and the only thing running was a node health test against the Lustre file system. ps output shows the test had been on the runqueue for an excessive amount of time.
> crash> bt > PID: 34321 TASK: ffff880fe1d517f0 CPU: 15 COMMAND: "nhc_fs_test" > #0 [ffff880fe1e2d6d8] schedule at ffffffff8141d9f7 > #1 [ffff880fe1e2d6e0] cl_env_info at ffffffffa0387b55 [obdclass] > #2 [ffff880fe1e2d6f0] cl_lock_counters at ffffffffa038fa66 [obdclass] > #3 [ffff880fe1e2d710] cl_lock_mutex_tail at ffffffffa038fae1 [obdclass] > #4 [ffff880fe1e2d730] cl_lock_mutex_get at ffffffffa0390732 [obdclass] > #5 [ffff880fe1e2d760] cl_locks_prune at ffffffffa0392001 [obdclass] > #6 [ffff880fe1e2d800] lov_delete_raid0 at ffffffffa07a3d31 [lov] > #7 [ffff880fe1e2d8b0] lov_object_delete at ffffffffa07a30f9 [lov] > #8 [ffff880fe1e2d8d0] lu_object_free at ffffffffa037fd75 [obdclass] > #9 [ffff880fe1e2d940] lu_object_put at ffffffffa03802ae [obdclass] > #10 [ffff880fe1e2d9a0] cl_object_put at ffffffffa03890ae [obdclass] > #11 [ffff880fe1e2d9b0] cl_inode_fini at ffffffffa087b56d [lustre] > #12 [ffff880fe1e2da80] ll_clear_inode at ffffffffa0842210 [lustre] > #13 [ffff880fe1e2dab0] ll_delete_inode at ffffffffa08428fd [lustre] > #14 [ffff880fe1e2dae0] evict at ffffffff81171c11 > #15 [ffff880fe1e2db10] iput at ffffffff81172082 > #16 [ffff880fe1e2db40] ll_d_iput at ffffffffa0812af6 [lustre] > #17 [ffff880fe1e2db80] d_kill at ffffffff8116c96b > #18 [ffff880fe1e2dbb0] dput at ffffffff8116fd21 > #19 [ffff880fe1e2dbe0] fput at ffffffff8115a5a7 > #20 [ffff880fe1e2dc20] filp_close at ffffffff811563f3 > #21 [ffff880fe1e2dc50] put_files_struct at ffffffff81051e24 > #22 [ffff880fe1e2dc90] exit_files at ffffffff81051ed3 > #23 [ffff880fe1e2dcc0] do_exit at ffffffff810538ec > #24 [ffff880fe1e2dd60] do_group_exit at ffffffff810540dc > #25 [ffff880fe1e2dda0] get_signal_to_deliver at ffffffff81065463 > #26 [ffff880fe1e2de40] do_notify_resume at ffffffff81002360 > #27 [ffff880fe1e2df50] int_signal at ffffffff81427ff0 > RIP: 00007ffff7b33300 RSP: 00007fffffffe908 RFLAGS: 00000246 > RAX: fffffffffffffffc RBX: 0000000000000000 RCX: ffffffffffffffff > RDX: 0000000000000060 RSI: 00000000004037c0 RDI: 0000000000000003 > RBP: 0000000000000000 R8: 000000000060aaa0 R9: 746c61656865646f > R10: 00007fffffffe6b0 R11: 0000000000000246 R12: 0000000000000003 > R13: 000000000060aaa0 R14: 00000000ffffffff R15: 00000000ffffffff > ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b > crash> ps -l | grep '\^' > ^[47137739341345] [IN] PID: 459 TASK: ffff8810335b67f0 CPU: 12 COMMAND: "kworker/12:1" > ^[47137739340142] [IN] PID: 458 TASK: ffff8810335b6040 CPU: 13 COMMAND: "kworker/13:1" [skip] > ^[47136767282677] [IN] PID: 462 TASK: ffff88083366e040 CPU: 9 COMMAND: "kworker/9:1" > ^[47093087128394] [RU] PID: 34321 TASK: ffff880fe1d517f0 CPU: 15 COMMAND: "nhc_fs_test"
Dump diving yields these values for various variables:
cl_env = ffff8810372a1e58 cl_object = ffff881037443f50 cl_object_header = ffff881037443ed0 cl_lock = ffff880fe3733df8 osc_lock = ffff8810395dcef8 cancel = 1
Of particular interest are the cll_holds and cll_flags fields.
crash> cl_lock.cll_flags ffff880fe3733df8 cll_flags = 6 /* 6 = CLF_CANCELPEND & CLF_DOOMED */ crash> cl_lock.cll_holds ffff880fe3733df8 cll_holds = 1
CLF_CANCELPEND is set by cl_lock_cancel when cll_holds !=0.
CLF_DOOMED is set by cl_lock_delete when cll_holds !=0.
Because these flags are set, we can infer that the while loop in cl_locks_prune has executed more than once for this particular lock. cl_locks_prune won't exit the loop until cll_holds becomes 0. Looping here for an extended period would certainly explain the heartbeat fault since nothing in the loop yields the CPU.
cll_holds is incremented in osc_lock_enqueue (among other places). Scanning the ptlrpc_request_set lists shows there's an LDLM_ENQUEUE rpc waiting to be sent (Lustre is wating for the connection to recover after the network errors).
> crash> load ptlrpc.sial > crash> ptlrpc > > Sent RPCS: ptlrpc_request_set.set_requests->rq_set_chain > thread ptlrpc_request pid xid nid opc phase bulk sent/deadline > =============================================================================================== > ptlrpcd_1: ffff881038cfdc00 0 x1475337000695248 10.10.100.15@o2ib1 6 RPC 0:0 0/0 > ptlrpcd_29: ffff880fe380e000 0 x1475337000694880 10.10.100.15@o2ib1 101 RPC 0:0 0/0 > =============================================================================================== opcode 101 == LDLM_ENQUEUE > crash> ptlrpc_request.rq_async_args ffff880fe380e000 > rq_async_args = { > pointer_arg = {0xffff88103bbffc00, 0xffff8810395dcf60, 0xffffffffa0716bc0 <osc_lock_upcall>, 0xffff8810395dcef8, 0xffff8810395dcf28, 0xffff8810395dcf68, 0xffff8810395dcf70, 0x0, 0x0, 0x0, 0x0}, > space = {18446612202036132864, 18446612201996144480, 18446744072106372032, 18446612201996144376, 18446612201996144424, 18446612201996144488, 18446612201996144496}
Note the osc_lock arg following the upcall arg in the LDLM_ENQUEUE request matches the osc_lock referenced by the cl_lock in cl_locks_prune. So cl_locks_prune is looping on the CPU waiting for an rpc to complete and release its hold on the cl_lock.
I think the solution is to simply call cond_resched in cl_locks_prune when cll_hold !=0. Patch in progress.