[LU-15821] Server driven blocking callbacks can wait behind general lru_size management Created: 05/May/22  Updated: 02/Aug/23  Resolved: 03/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.4

Type: Bug Priority: Minor
Reporter: Patrick Farrell Assignee: Patrick Farrell
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16285 Prolong the lock BL timeout Resolved
is related to LU-15822 debug in lock_matches Open
is related to LU-15915 /bin/rm: fts_read failed: Cannot send... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The current code places bl_ast lock callbacks at the end of the global BL callback queue.  This is bad because it causes urgent requests from the server to wait behind non-urgent cleanup tasks to keep lru_size at the right level.

This can lead to evictions if there is a large queue of items in the global queue so the callback is not serviced in a timely manner.

Put bl_ast callbacks on the priority queue so they do not wait behind the background traffic.



 Comments   
Comment by Gerrit Updater [ 05/May/22 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47215
Subject: LU-15821 ldlm: Prioritize blocking callbacks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 65a5b8d27e6d6a0acf8bc87458b8837509e60b23

Comment by Gerrit Updater [ 03/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47215/
Subject: LU-15821 ldlm: Prioritize blocking callbacks
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2d59294d52b696125acc464e5910c893d9aef237

Comment by Peter Jones [ 03/Aug/22 ]

Landed for 2.16

Comment by Gerrit Updater [ 04/Aug/22 ]

"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/48122
Subject: LU-15821 ldlm: Fix unsafe blwi access
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 97326667eb8f960fa2996099a6bd2b96496d026e
(Patch not needed)

Comment by Gerrit Updater [ 05/Oct/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48764
Subject: LU-15821 ldlm: Prioritize blocking callbacks
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 74666c9fff24126922e5635fbaf2394bf7eda118

Comment by Stephane Thiell [ 12/Jan/23 ]

It would be nice to have this patch backported to 2.15.x, we have been running it for a while on 2.15.1 clients with good results.

Comment by Gerrit Updater [ 12/Jan/23 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49610
Subject: LU-15821 ldlm: Prioritize blocking callbacks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: b3e9e0eeadd783d22065871df28ea32f2d3c6934

Comment by Andreas Dilger [ 12/Jan/23 ]

we have been running it for a while on 2.15.1 clients with good results.

Stephane, by "good results" do you mean "it doesn't cause problems" or "it visibly improved/removed some problem that you were seeing with client evictions"? In the use case that drove the initial development of this patch it didn't totally solve the issue. Yang Sheng also just developed patch https://review.whamcloud.com/49527 "LU-16285 ldlm: improvement of bl lock queue" to further improve the handling of highly-contended DLM locks.

Comment by Stephane Thiell [ 13/Jan/23 ]

Andreas, since we have applied this patch, last October 2022, we have not seen again the following problems from two workloads that were previously causing trouble:

  • GNU parallel with --tmpdir
  • sort with --temporary-directory

In both use cases, files are created in temporary directory and are used unlinked, not visible in the directory but actually still open, this may have triggered some sort of contention in Lustre leading to evictions.

But at the same time, we have also tried to redirect our users to local scratch filesystems to avoid further issues, as a parallel filesystem was not really needed. So I can't tell you for sure that this patch resolves these issues, but at least it didn't introduce anything bad and we would like to keep it for now. It would be convenient for us if it was added to 2.15, but otherwise, I will just continue to backport it. I hope that the context helps a bit.

Thanks also for the pointer to the other patch from Yang Sheng.

Comment by Gerrit Updater [ 02/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49610/
Subject: LU-15821 ldlm: Prioritize blocking callbacks
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 8ca1186151faa778edd5abd361e92fcd5d8ff56b

Generated at Sat Feb 10 03:21:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.