[LU-13413] Lustre soft lockups with peer credit exhaustion Created: 03/Apr/20  Updated: 11/Jul/22  Resolved: 02/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James A Simmons Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: ORNL
Environment:

RHEL7.7 running Lustre 2.12 LTS


Attachments: Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-12600 Lustre tgt_brw_write() bug Resolved
is related to LU-14221 Client hangs when using DoM with a fi... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Currently our production file system is experience peer credit exhaustion which is leading to soft locks. The back trace are attached. Instead of a soft lockups we should be having evicts.

[2642432.333239]  [<ffffffff9477544a>] queued_spin_lock_slowpath+0xb/0xf
[2642432.333241]  [<ffffffff94783330>] _raw_spin_lock+0x20/0x30
[2642432.333258]  [<ffffffffc146202c>] lock_res_and_lock+0x2c/0x50 [ptlrpc]
[2642432.333273]  [<ffffffffc1469c61>] ldlm_lock_enqueue+0x1b1/0xa20 [ptlrpc]
[2642432.333294]  [<ffffffffc14b9891>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[2642432.333311]  [<ffffffffc1492506>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[2642432.333332]  [<ffffffffc14bb300>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[2642432.333357]  [<ffffffffc151acf2>] tgt_enqueue+0x62/0x210 [ptlrpc]
[2642432.333381]  [<ffffffffc1521b0a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[2642432.333404]  [<ffffffffc14fb021>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[2642432.333408]  [<ffffffffc103fbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[2642432.333428]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2642432.333449]  [<ffffffffc14c3285>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[2642432.333451]  [<ffffffff940d3903>] ? __wake_up+0x13/0x20
[2642432.333471]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]

 



 Comments   
Comment by Andreas Dilger [ 03/Apr/20 ]

Hi James, any changes to the system that would have triggered this to start happening (upgraded Lustre, new patch, new clients, new workload, etc.)?

Comment by James A Simmons [ 06/Apr/20 ]

I talked to our admin team and they stated its the same problem since October with the upgrade to 2.12 LTS. With LU-12600 it did reduce the problem but it still exist. This problem shows up when clients are hitting the servers with large numbers of small I/Os.

Do you think tuning the lru_resize and lru_size might help? The current values are

lru_size=2000 and lru_resize=3900000

Comment by Andreas Dilger [ 06/Apr/20 ]

James, how many locks are actually being held by the clients and are on the servers? Running "lctl get_param ldlm.namespaces.*.lock_count" on some clients and servers would give a good idea of where this is at. There were a couple of bugs related to not canceling the DLM locks on the clients, even above LRU size, so it would be good to know if this is the problem you are seeing.

Also, are there any patches above 2.12.4?

Comment by James A Simmons [ 06/Apr/20 ]

The only patch we are running outside of 2.12 is a port of LU-13101.

f2-oss1c1: ldlm.namespaces.filter-f2-OST0002_UUID.lock_count=13071
f2-oss1c1: ldlm.namespaces.filter-f2-OST0026_UUID.lock_count=15371
f2-oss1f5: ldlm.namespaces.filter-f2-OST001d_UUID.lock_count=9772
f2-oss1f5: ldlm.namespaces.filter-f2-OST0041_UUID.lock_count=23522
f2-oss1f4: ldlm.namespaces.filter-f2-OST0017_UUID.lock_count=6410
f2-oss1f4: ldlm.namespaces.filter-f2-OST003b_UUID.lock_count=10967
f2-oss1e2: ldlm.namespaces.filter-f2-OST000a_UUID.lock_count=9424
f2-oss1e2: ldlm.namespaces.filter-f2-OST002e_UUID.lock_count=14095
f2-oss1e6: ldlm.namespaces.filter-f2-OST0022_UUID.lock_count=9126
f2-oss1e6: ldlm.namespaces.filter-f2-OST0046_UUID.lock_count=14720
f2-oss1e1: ldlm.namespaces.filter-f2-OST0004_UUID.lock_count=5522
f2-oss1e1: ldlm.namespaces.filter-f2-OST0028_UUID.lock_count=13113
f2-oss1c4: ldlm.namespaces.filter-f2-OST0014_UUID.lock_count=31000
f2-oss1c4: ldlm.namespaces.filter-f2-OST0038_UUID.lock_count=8096
f2-oss1a1: ldlm.namespaces.filter-f2-OST0000_UUID.lock_count=9730
f2-oss1a1: ldlm.namespaces.filter-f2-OST0024_UUID.lock_count=10386
f2-oss1a2: ldlm.namespaces.filter-f2-OST0006_UUID.lock_count=6744
f2-oss1a2: ldlm.namespaces.filter-f2-OST002a_UUID.lock_count=17169
f2-oss1b2: ldlm.namespaces.filter-f2-OST0007_UUID.lock_count=7458
f2-oss1b2: ldlm.namespaces.filter-f2-OST002b_UUID.lock_count=12837
f2-oss1e3: ldlm.namespaces.filter-f2-OST0010_UUID.lock_count=17431
f2-oss1e3: ldlm.namespaces.filter-f2-OST0034_UUID.lock_count=11464
f2-oss1d4: ldlm.namespaces.filter-f2-OST0015_UUID.lock_count=9836
f2-oss1d4: ldlm.namespaces.filter-f2-OST0039_UUID.lock_count=14168
f2-oss1d3: ldlm.namespaces.filter-f2-OST000f_UUID.lock_count=11082
f2-oss1d3: ldlm.namespaces.filter-f2-OST0033_UUID.lock_count=9801
f2-oss1f6: ldlm.namespaces.filter-f2-OST0023_UUID.lock_count=7558
f2-oss1f6: ldlm.namespaces.filter-f2-OST0047_UUID.lock_count=10884
f2-oss1c6: ldlm.namespaces.filter-f2-OST0020_UUID.lock_count=11754
f2-oss1c6: ldlm.namespaces.filter-f2-OST0044_UUID.lock_count=21609
f2-oss1f2: ldlm.namespaces.filter-f2-OST000b_UUID.lock_count=7273
f2-oss1f2: ldlm.namespaces.filter-f2-OST002f_UUID.lock_count=16546
f2-oss1d2: ldlm.namespaces.filter-f2-OST0009_UUID.lock_count=5351
f2-oss1d2: ldlm.namespaces.filter-f2-OST002d_UUID.lock_count=11648
f2-oss1b4: ldlm.namespaces.filter-f2-OST0013_UUID.lock_count=7091
f2-oss1b4: ldlm.namespaces.filter-f2-OST0037_UUID.lock_count=17672
f2-oss1d1: ldlm.namespaces.filter-f2-OST0003_UUID.lock_count=10786
f2-oss1d1: ldlm.namespaces.filter-f2-OST0027_UUID.lock_count=12821
f2-oss1a3: ldlm.namespaces.filter-f2-OST000c_UUID.lock_count=8583
f2-oss1a3: ldlm.namespaces.filter-f2-OST0030_UUID.lock_count=10223
f2-oss1b1: ldlm.namespaces.filter-f2-OST0001_UUID.lock_count=19223
f2-oss1b1: ldlm.namespaces.filter-f2-OST0025_UUID.lock_count=12262
f2-oss1c3: ldlm.namespaces.filter-f2-OST000e_UUID.lock_count=19796
f2-oss1c3: ldlm.namespaces.filter-f2-OST0032_UUID.lock_count=19187
f2-oss1c2: ldlm.namespaces.filter-f2-OST0008_UUID.lock_count=22208
f2-oss1c2: ldlm.namespaces.filter-f2-OST002c_UUID.lock_count=17602
f2-oss1e5: ldlm.namespaces.filter-f2-OST001c_UUID.lock_count=10177
f2-oss1e5: ldlm.namespaces.filter-f2-OST0040_UUID.lock_count=18898
f2-oss1d6: ldlm.namespaces.filter-f2-OST0021_UUID.lock_count=9237
f2-oss1d6: ldlm.namespaces.filter-f2-OST0045_UUID.lock_count=23232
f2-oss1b6: ldlm.namespaces.filter-f2-OST001f_UUID.lock_count=8040
f2-oss1b6: ldlm.namespaces.filter-f2-OST0043_UUID.lock_count=22877
f2-oss1f1: ldlm.namespaces.filter-f2-OST0005_UUID.lock_count=8117
f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207
f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895
f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789

Comment by Andreas Dilger [ 06/Apr/20 ]

f2-oss1f1: ldlm.namespaces.filter-f2-OST0029_UUID.lock_count=9207
f2-oss1b5: ldlm.namespaces.filter-f2-OST0019_UUID.lock_count=12895
f2-oss1b5: ldlm.namespaces.filter-f2-OST003d_UUID.lock_count=25789

OK, so this is totally reasonable, even low by my expectation, so it doesn't seem like the spinlock hold times are driven by having too many DLM locks to process, unless there is a sudden spike of locks at the time the problem occurs?

According to the stack traces, there are only really a few "interesting" process, the rest are just blocked on (likely) the same spinlock in either ldlm_lock_enqueue() or ofd_lvbo_fill(). In both cases, it looks like the ldlm_cn threads are processing a potentially long list of locks on a single resource:

[2642432.309284] CPU: 8 PID: 57948 Comm: ldlm_cn00_044  3.10.0-1062.9.1.el7.x86_64
[2642432.309314] Call Trace:
[2642432.309331]  [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc]
[2642432.309362]  [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
[2642432.309383]  [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc]
[2642432.309397]  [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc]
[2642432.309414]  [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc]
[2642432.309431]  [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
[2642432.309447]  [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[2642432.309467]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2642432.309509]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[2642432.309530]  [<ffffffff940c61f1>] kthread+0xd1/0xe0

[2644718.775432] CPU: 15 PID: 52157 Comm: ldlm_cn01_021  3.10.0-1062.9.1.el7.x86_64
[2644718.775459] Call Trace:
[2644718.775477]  [<ffffffffc14808e3>] ldlm_process_extent_lock+0x213/0x490 [ptlrpc]
[2644718.775508]  [<ffffffffc146959e>] ldlm_reprocess_queue+0x1be/0x3f0 [ptlrpc]
[2644718.775528]  [<ffffffffc146a5d2>] __ldlm_reprocess_all+0x102/0x360 [ptlrpc]
[2644718.775543]  [<ffffffffc146a843>] ldlm_reprocess_all+0x13/0x20 [ptlrpc]
[2644718.775559]  [<ffffffffc148f4df>] ldlm_request_cancel+0x42f/0x780 [ptlrpc]
[2644718.775576]  [<ffffffffc14937a2>] ldlm_handle_cancel+0x232/0x2b0 [ptlrpc]
[2644718.775593]  [<ffffffffc1493978>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
[2644718.775613]  [<ffffffffc14c646b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2644718.775656]  [<ffffffffc14c9dd4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[2644718.775677]  [<ffffffff940c61f1>] kthread+0xd1/0xe0

I would guess there is some application which is "piling on" a single object and getting lots of conflicting locks (e.g. high IOPS small/unaligned writes on a HDD OST) that generates a lot of conflicting locks, then repeatedly reprocessing the lock each time a new one is granted/released? I was wondering whether patch https://review.whamcloud.com/33221 "LU-11085 ldlm: simplify use of interval-tree" had already landed and might be causing this, but I see it has not. It seems like the interval tree should protect us against scanning a lot of non-overlapping locks, so possibly there are a lot of overlapping lock requests from the application?

Comment by Dustin Leverman [ 30/Apr/20 ]

Andreas,

           I wanted to follow up with you on this since it has been a while. Generally the workload that triggers this happens during the first week of the month (a specific code runs by a subset of users on a set schedule). Based on your input about conflicting locks, we changed the lru_size configuration on the compute nodes to limit to 200 instead of the dynamic setting. We were thinking that this would limit the number of metadata transactions a single compute node could issue concurrently to the MDS. After this change we did not see more crashes, but it is possible this will crop up here again in the next week.

 

Thanks,

Dustin

Comment by Andreas Dilger [ 25/Feb/21 ]

Dustin, James, it looks like this problem has not been hit again since the lru_size=200 change was made back in April. Were you ever able to track down if there was a specific workload that was causing the high lock traffic?

Generated at Sat Feb 10 03:01:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.