[LU-15821] Server driven blocking callbacks can wait behind general lru_size management Created: 05/May/22 Updated: 02/Aug/23 Resolved: 03/Aug/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Patrick Farrell | Assignee: | Patrick Farrell |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
The current code places bl_ast lock callbacks at the end of the global BL callback queue. This is bad because it causes urgent requests from the server to wait behind non-urgent cleanup tasks to keep lru_size at the right level. This can lead to evictions if there is a large queue of items in the global queue so the callback is not serviced in a timely manner. Put bl_ast callbacks on the priority queue so they do not wait behind the background traffic. |
| Comments |
| Comment by Gerrit Updater [ 05/May/22 ] |
|
"Patrick Farrell <pfarrell@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47215 |
| Comment by Gerrit Updater [ 03/Aug/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47215/ |
| Comment by Peter Jones [ 03/Aug/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 04/Aug/22 ] |
|
|
| Comment by Gerrit Updater [ 05/Oct/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48764 |
| Comment by Stephane Thiell [ 12/Jan/23 ] |
|
It would be nice to have this patch backported to 2.15.x, we have been running it for a while on 2.15.1 clients with good results. |
| Comment by Gerrit Updater [ 12/Jan/23 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49610 |
| Comment by Andreas Dilger [ 12/Jan/23 ] |
Stephane, by "good results" do you mean "it doesn't cause problems" or "it visibly improved/removed some problem that you were seeing with client evictions"? In the use case that drove the initial development of this patch it didn't totally solve the issue. Yang Sheng also just developed patch https://review.whamcloud.com/49527 " |
| Comment by Stephane Thiell [ 13/Jan/23 ] |
|
Andreas, since we have applied this patch, last October 2022, we have not seen again the following problems from two workloads that were previously causing trouble:
In both use cases, files are created in temporary directory and are used unlinked, not visible in the directory but actually still open, this may have triggered some sort of contention in Lustre leading to evictions. But at the same time, we have also tried to redirect our users to local scratch filesystems to avoid further issues, as a parallel filesystem was not really needed. So I can't tell you for sure that this patch resolves these issues, but at least it didn't introduce anything bad and we would like to keep it for now. It would be convenient for us if it was added to 2.15, but otherwise, I will just continue to backport it. I hope that the context helps a bit. Thanks also for the pointer to the other patch from Yang Sheng. |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49610/ |