[LU-13413] Lustre soft lockups with peer credit exhaustion Created: 03/Apr/20 Updated: 11/Jul/22 Resolved: 02/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | James A Simmons | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ORNL | ||
| Environment: |
RHEL7.7 running Lustre 2.12 LTS |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Currently our production file system is experience peer credit exhaustion which is leading to soft locks. The back trace are attached. Instead of a soft lockups we should be having evicts. [2642432.333239] [<ffffffff9477544a>] queued_spin_lock_slowpath+0xb/0xf [2642432.333241] [<ffffffff94783330>] _raw_spin_lock+0x20/0x30 [2642432.333258] [<ffffffffc146202c>] lock_res_and_lock+0x2c/0x50 [ptlrpc] [2642432.333273] [<ffffffffc1469c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc] [2642432.333294] [<ffffffffc14b9891>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [2642432.333311] [<ffffffffc1492506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [2642432.333332] [<ffffffffc14bb300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] [2642432.333357] [<ffffffffc151acf2>] tgt_enqueue+0x62/0x210 [ptlrpc] [2642432.333381] [<ffffffffc1521b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [2642432.333404] [<ffffffffc14fb021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2642432.333408] [<ffffffffc103fbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2642432.333428] [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2642432.333449] [<ffffffffc14c3285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2642432.333451] [<ffffffff940d3903>] ? __wake_up+0x13/0x20 [2642432.333471] [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
|
| Comments |
| Comment by Andreas Dilger [ 03/Apr/20 ] |
|
Hi James, any changes to the system that would have triggered this to start happening (upgraded Lustre, new patch, new clients, new workload, etc.)? |
| Comment by James A Simmons [ 06/Apr/20 ] |
|
I talked to our admin team and they stated its the same problem since October with the upgrade to 2.12 LTS. With Do you think tuning the lru_resize and lru_size might help? The current values are lru_size=2000 and lru_resize=3900000 |
| Comment by Andreas Dilger [ 06/Apr/20 ] |
|
James, how many locks are actually being held by the clients and are on the servers? Running "lctl get_param ldlm.namespaces.*.lock_count" on some clients and servers would give a good idea of where this is at. There were a couple of bugs related to not canceling the DLM locks on the clients, even above LRU size, so it would be good to know if this is the problem you are seeing. Also, are there any patches above 2.12.4? |
| Comment by James A Simmons [ 06/Apr/20 ] |
|
The only patch we are running outside of 2.12 is a port of f2-oss1c1: ldlm.namespaces.filter-f2-OST0002_UUID.lock_count=13071 |
| Comment by Andreas Dilger [ 06/Apr/20 ] |
OK, so this is totally reasonable, even low by my expectation, so it doesn't seem like the spinlock hold times are driven by having too many DLM locks to process, unless there is a sudden spike of locks at the time the problem occurs? According to the stack traces, there are only really a few "interesting" process, the rest are just blocked on (likely) the same spinlock in either ldlm_lock_enqueue() or ofd_lvbo_fill(). In both cases, it looks like the ldlm_cn threads are processing a potentially long list of locks on a single resource: [2642432.309284] CPU: 8 PID: 57948 Comm: ldlm_cn00_044 3.10.0-1062.9.1.el7.x86_64 [2642432.309314] Call Trace: [2642432.309331] [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc] [2642432.309362] [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [2642432.309383] [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc] [2642432.309397] [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc] [2642432.309414] [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc] [2642432.309431] [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [2642432.309447] [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [2642432.309467] [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2642432.309509] [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2642432.309530] [<ffffffff940c61f1>] kthread+0xd1/0xe0 [2644718.775432] CPU: 15 PID: 52157 Comm: ldlm_cn01_021 3.10.0-1062.9.1.el7.x86_64 [2644718.775459] Call Trace: [2644718.775477] [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc] [2644718.775508] [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc] [2644718.775528] [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc] [2644718.775543] [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc] [2644718.775559] [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc] [2644718.775576] [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc] [2644718.775593] [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [2644718.775613] [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2644718.775656] [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2644718.775677] [<ffffffff940c61f1>] kthread+0xd1/0xe0 I would guess there is some application which is "piling on" a single object and getting lots of conflicting locks (e.g. high IOPS small/unaligned writes on a HDD OST) that generates a lot of conflicting locks, then repeatedly reprocessing the lock each time a new one is granted/released? I was wondering whether patch https://review.whamcloud.com/33221 "LU-11085 ldlm: simplify use of interval-tree" had already landed and might be causing this, but I see it has not. It seems like the interval tree should protect us against scanning a lot of non-overlapping locks, so possibly there are a lot of overlapping lock requests from the application? |
| Comment by Dustin Leverman [ 30/Apr/20 ] |
|
Andreas, I wanted to follow up with you on this since it has been a while. Generally the workload that triggers this happens during the first week of the month (a specific code runs by a subset of users on a set schedule). Based on your input about conflicting locks, we changed the lru_size configuration on the compute nodes to limit to 200 instead of the dynamic setting. We were thinking that this would limit the number of metadata transactions a single compute node could issue concurrently to the MDS. After this change we did not see more crashes, but it is possible this will crop up here again in the next week.
Thanks, Dustin |
| Comment by Andreas Dilger [ 25/Feb/21 ] |
|
Dustin, James, it looks like this problem has not been hit again since the lru_size=200 change was made back in April. Were you ever able to track down if there was a specific workload that was causing the high lock traffic? |