[LU-18053] Add active osc_lru_shrink() limit - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.16.0, Lustre 2.15.7
Affects Version/s: Lustre 2.15.5
Labels:
- LTS15
- llnl
- performance
Environment:
RHEL 8.9 client running lustre 2.15.5

Epic/Theme:
- Performance
- client
Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When running a single shared file IOR on a compute node with a large number of cores it's possible to trigger soft locks. Applying ~~LU-17630~~ helps but doesn't entirely resolve the issue. The stack traces logged by the soft lockup watchdog indicate the cause is heavy contention in delete_from_page_cache() on the page cache spin lock.

RIP: 0010:delete_from_page_cache+0x52/0x70
[ 9375.915829] generic_error_remove_page+0x36/0x60
[ 9375.915837] cl_page_discard+0x47/0x80 [obdclass]
[ 9375.915883] discard_pagevec+0x7d/0x150 [osc]
[ 9375.915900] osc_lru_shrink+0x87f/0x8b0 [osc]
[ 9375.915913] lru_queue_work+0xfd/0x230 [osc]
[ 9375.915925] work_interpreter+0x32/0x110 [ptlrpc]
[ 9375.915992] ptlrpc_check_set+0x5cf/0x1fc0 [ptlrpc]
[ 9375.916052] ptlrpcd+0x6df/0xa70 [ptlrpc]
[ 9375.916176] kthread+0x14c/0x170

It looks like this is possible because:
1. Multiple callers pass 'force=1' to osc_lru_shrink() allowing multiple threads to run concurrently. lru_queue_work() does use 'force=0' which is good.
2. There is no per-filesystem or per-node limit on how many threads can run osc_lru_shrink(). It's only limited per client_obd using the 'cl_lru_shrinkers' atomic.

I'll push a patch for review which adds a per-filesystem limit. Interestingly, it looks portions of this may have been implemented long ago but not completed. The proposed patch still needs to be tested on a system with a large number of OSCs but I wanted to post it for initial feedback.

Attachments

Issue Links

is related to

LU-17630 osc_lru_shrink() should not block scheduling for long

Resolved

Activity

[LU-18053] Add active osc_lru_shrink() limit

Gerrit Updater added a comment - 21/May/25 3:42 AM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56390/
Subject: ~~LU-18053~~ osc: add another cond_resched() to osc_lru_shrink()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: 1329003c03d7a40262cad74c1446313eb94a0f9a

Gerrit Updater added a comment - 21/May/25 3:42 AM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56390/ Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 1329003c03d7a40262cad74c1446313eb94a0f9a

Peter Jones added a comment - 12/Dec/24 7:00 PM

Included in 2.16

Peter Jones added a comment - 12/Dec/24 7:00 PM Included in 2.16

Gerrit Updater added a comment - 17/Sep/24 9:48 PM

"Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56390
Subject: ~~LU-18053~~ osc: add another cond_resched() to osc_lru_shrink()
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: a132ecac9e077ffb42c1e04fd41ba5aa4ecd34f2

Gerrit Updater added a comment - 17/Sep/24 9:48 PM "Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56390 Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a132ecac9e077ffb42c1e04fd41ba5aa4ecd34f2

Gerrit Updater added a comment - 23/Aug/24 9:58 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55888/
Subject: ~~LU-18053~~ osc: add another cond_resched() to osc_lru_shrink()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 66549c1540b2931ae1d1d1ebb50afbf15683baf4

Gerrit Updater added a comment - 23/Aug/24 9:58 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55888/ Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 66549c1540b2931ae1d1d1ebb50afbf15683baf4

Gerrit Updater added a comment - 30/Jul/24 3:57 PM

"Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55888
Subject: ~~LU-18053~~ osc: add another cond_resched() to osc_lru_shrink()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9a9e505caf24d337eb77396bc993445c56e94f1

Gerrit Updater added a comment - 30/Jul/24 3:57 PM "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55888 Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a9a9e505caf24d337eb77396bc993445c56e94f1

Brian Behlendorf added a comment - 22/Jul/24 8:16 PM

Our last round of testing included a backport of the ~~LU-17630~~ patch. This helped a bit, but wasn't sufficient. We still were not able to complete a shared file IOR and soft lockups were reported on the clients. https://review.whamcloud.com/c/fs/lustre-release/+/55830 was aimed at further reducing this contention.

Brian Behlendorf added a comment - 22/Jul/24 8:16 PM Our last round of testing included a backport of the LU-17630 patch. This helped a bit, but wasn't sufficient. We still were not able to complete a shared file IOR and soft lockups were reported on the clients. https://review.whamcloud.com/c/fs/lustre-release/+/55830 was aimed at further reducing this contention.

Alexander Zarochentsev added a comment - 22/Jul/24 5:52 PM

Brian,
can you try https://review.whamcloud.com/c/fs/lustre-release/+/54346 from ~~LU-17630~~ and check whether soft lockups gone?

Thanks,
Zam

Alexander Zarochentsev added a comment - 22/Jul/24 5:52 PM Brian, can you try https://review.whamcloud.com/c/fs/lustre-release/+/54346 from LU-17630 and check whether soft lockups gone? Thanks, Zam

Gerrit Updater added a comment - 22/Jul/24 4:06 PM - edited

~~"Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch:~~ https://review.whamcloud.com/c/fs/lustre-release/+/55830
~~Subject: ~~LU-18053~~ osc: add active osc_lru_shrink() limit~~
~~Project: fs/lustre-release~~
~~Branch: master~~
~~Current Patch Set: 1~~
~~Commit: e3064da710de98da43f8bb607dab2ab140e087e3~~

Gerrit Updater added a comment - 22/Jul/24 4:06 PM - edited "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55830 Subject: LU-18053 osc: add active osc_lru_shrink() limit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3064da710de98da43f8bb607dab2ab140e087e3

Brian Behlendorf added a comment - 20/Jul/24 1:12 AM - edited

It seems I'm having a bit of trouble with gerrit. For the moment, I've pushed the proposed change to https://github.com/LLNL/lustre/commits/add-fs-lru-shrink-limit/ for reference. I'll look in to pushing it correctly on Monday.

Brian Behlendorf added a comment - 20/Jul/24 1:12 AM - edited It seems I'm having a bit of trouble with gerrit. For the moment, I've pushed the proposed change to https://github.com/LLNL/lustre/commits/add-fs-lru-shrink-limit/ for reference. I'll look in to pushing it correctly on Monday.

People

Assignee:: Brian Behlendorf

Reporter:: Brian Behlendorf

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 20/Jul/24 12:29 AM

Updated:: 21/May/25 12:30 PM

Resolved:: 12/Dec/24 7:00 PM