Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.3
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

We have upgraded to 2.12.3 recently and have been seeing bad soft lockups, going as far as hard lockups, due to ldlm_reprocess_all in two different patterns:

On OSS all active tasks are in either of these:

 #6 [ffff881c93bafbd8] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881c93bafbe8] __ldlm_reprocess_all at ffffffffc0e0b79a [ptlrpc]
 #8 [ffff881c93bafc38] ldlm_reprocess_all at ffffffffc0e0ba30 [ptlrpc]
 #9 [ffff881c93bafc48] ldlm_handle_enqueue0 at ffffffffc0e33314 [ptlrpc]
#10 [ffff881c93bafcd8] tgt_enqueue at ffffffffc0eb8572 [ptlrpc]
#11 [ffff881c93bafcf8] tgt_request_handle at ffffffffc0ebe8ba [ptlrpc]
#12 [ffff881c93bafd40] ptlrpc_server_handle_request at ffffffffc0e63f13 [ptlrpc]
#13 [ffff881c93bafde0] ptlrpc_main at ffffffffc0e67862 [ptlrpc]
#14 [ffff881c93bafec8] kthread at ffffffff810b4031
#15 [ffff881c93baff50] ret_from_fork at ffffffff816c155d

//

 #4 [ffff881c92687c08] native_queued_spin_lock_slowpath at ffffffff810fdfc2
 #5 [ffff881c92687c10] queued_spin_lock_slowpath at ffffffff816a8ff4
 #6 [ffff881c92687c20] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881c92687c30] lock_res_and_lock at ffffffffc0e0302c [ptlrpc]
 #8 [ffff881c92687c48] ldlm_handle_enqueue0 at ffffffffc0e339a8 [ptlrpc]
(the one at this comment:
/* We never send a blocking AST until the lock is granted, but
  * we can tell it right now */
)
 #9 [ffff881c92687cd8] tgt_enqueue at ffffffffc0eb8572 [ptlrpc]
#10 [ffff881c92687cf8] tgt_request_handle at ffffffffc0ebe8ba [ptlrpc]
#11 [ffff881c92687d40] ptlrpc_server_handle_request at ffffffffc0e63f13 [ptlrpc]
#12 [ffff881c92687de0] ptlrpc_main at ffffffffc0e67862 [ptlrpc]
#13 [ffff881c92687ec8] kthread at ffffffff810b4031
#14 [ffff881c92687f50] ret_from_fork at ffffffff816c155d

On MGS, active tasks are:

 #4 [ffff881d33ff3b80] native_queued_spin_lock_slowpath at ffffffff810fdfc2
 #5 [ffff881d33ff3b88] queued_spin_lock_slowpath at ffffffff816a8ff4
 #6 [ffff881d33ff3b98] _raw_spin_lock at ffffffff816b6b70
 #7 [ffff881d33ff3ba8] ldlm_handle_conflict_lock at ffffffffc0d8fa08 [ptlrpc]
 #8 [ffff881d33ff3be0] ldlm_lock_enqueue at ffffffffc0d8ff23 [ptlrpc]
 #9 [ffff881d33ff3c48] ldlm_handle_enqueue0 at ffffffffc0db8843 [ptlrpc]
#10 [ffff881d33ff3cd8] tgt_enqueue at ffffffffc0e3d572 [ptlrpc]
#11 [ffff881d33ff3cf8] tgt_request_handle at ffffffffc0e438ba [ptlrpc]
#12 [ffff881d33ff3d40] ptlrpc_server_handle_request at ffffffffc0de8f13 [ptlrpc]
#13 [ffff881d33ff3de0] ptlrpc_main at ffffffffc0dec862 [ptlrpc]
#14 [ffff881d33ff3ec8] kthread at ffffffff810b4031
#15 [ffff881d33ff3f50] ret_from_fork at ffffffff816c155d

In the OSS case, the resource was an object, we could go back to the file with lr_name and looking for the oid easily.
The resource was had huge lists in lr_granted (23668) and lr_waiting (70074), but I've seen resources with more on other crashes (this one happened quite a few times); all of these are in l_req_mode = LCK_PR and in this case correspond to mmap'd libraries (e.g. mpi dlopen'd stuff used by jobs); they were all cancelled by in this case a rsync updating the mtime on the inode (utimensat) because the source of the rsync has sub-second precisions and lustre does not so rsync faithfully tries to stick these back.
For now I'm making sure these rsyncs don't touch the files anymore but that might not always be possible.

In the MGS case, I've seen some filesystem lock (lr_name = fsname in hex) with either CONFIG_T_RECOVER (0x2) when some OSS reconnected, or with CONFIG_T_CONFIG (0x0) when updating some ost pool (pool_add/remove); from memory we have 300k entries or so in the list. In the case I'm looking at the crash happened with only one entry in lr_granted and 383193 in lr_waiting: all are in LCK_CR except the first lock in lr_waiting in this case.

I've looked at the last granted lock and it's associated to a thread waiting on an alloc:

PID: 22186  TASK: ffff881f4a6f6eb0  CPU: 12  COMMAND: "ll_mgs_0003"
 #0 [ffff881f1f433a18] __schedule at ffffffff816b3de4
 #1 [ffff881f1f433aa8] __cond_resched at ffffffff810c4bd6
 #2 [ffff881f1f433ac0] _cond_resched at ffffffff816b46aa
 #3 [ffff881f1f433ad0] kmem_cache_alloc at ffffffff811e3f15
 #4 [ffff881f1f433b10] LNetMDBind at ffffffffc0b3005c [lnet]
 #5 [ffff881f1f433b50] ptl_send_buf at ffffffffc0dd3c6f [ptlrpc]
 #6 [ffff881f1f433c08] ptlrpc_send_reply at ffffffffc0dd705b [ptlrpc]
 #7 [ffff881f1f433c80] target_send_reply_msg at ffffffffc0d9854e [ptlrpc]
 #8 [ffff881f1f433ca0] target_send_reply at ffffffffc0da2a5e [ptlrpc]
 #9 [ffff881f1f433cf8] tgt_request_handle at ffffffffc0e43527 [ptlrpc]
#10 [ffff881f1f433d40] ptlrpc_server_handle_request at ffffffffc0de8f13 [ptlrpc]

just waiting to be scheduled, but it's not easy with spinlocks hogging all the cores. It doesn't even look like a client problem here, but at this point the server was already slow to respond causing more OSTs to reconnect causing more recoveries causing more of these... so hard to tell.

In both cases, the root problem for me is the same: why are we iterating on this list under a spin lock so often?! We shouldn't ever use a spinlock for list traversal in the first place anyway.
I think our setup is especially bad because we have mixed interconnects with different latencies, so by the time the first of the slower half of the clients accept to cancel the lock, the first half of the faster clients already all tried to reconnect so lr_waiting is big and all the rest of the accepted cancels will be horribly slow (because it's the cancels being ok'd that triggers list traversal to check if some waiting locks could be granted at this point)

Anyway, this has become quite problematic for us lately and doesn't look like it is much better on master, so I would be more than open for workarounds or ideas to try out.

Details

Description

Attachments

Activity

People

Dates