[LU-9114] Make MDS (And other server threads?) hog CPU less - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

It's somewhat widely seen in various logs that pacemaker complaints its thread was not scheduled for tens of seconds which is way too excessive.
Indeed MDS is pretty cpu hungry, but we need to ensure we insert enough of schedule points so that other processes get a shot at CPU too.

There are also some bandaids discussed like using numa settings to cordon off one cpu from use by Lustre, but those are just that - bandaids.

We probably can play with various debug settings that warn about this and make the timeouts lower to try and catch more of the offenders. Likely have a bunch in flock code with its double loops

Attachments

Issue Links

is related to

LU-4423 Tracking of patches from upstream kernel to Lustre client

Resolved

Activity

[LU-9114] Make MDS (And other server threads?) hog CPU less

Peter Jones added a comment - 17/Dec/20 6:29 PM

Landed for 2.14

Peter Jones added a comment - 17/Dec/20 6:29 PM Landed for 2.14

Gerrit Updater added a comment - 17/Dec/20 5:00 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39435/
Subject: ~~LU-9114~~ ldlm: don't compute sumsq for pool stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 966f6bb550be52e4bf5dd7fd38a0d707fe2a5072

Gerrit Updater added a comment - 17/Dec/20 5:00 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39435/ Subject: LU-9114 ldlm: don't compute sumsq for pool stats Project: fs/lustre-release Branch: master Current Patch Set: Commit: 966f6bb550be52e4bf5dd7fd38a0d707fe2a5072

Gerrit Updater added a comment - 17/Jul/20 5:47 PM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39435
Subject: ~~LU-9114~~ ldlm: don't compute sumsq for pool stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bbe08711a531af70404ef1ba5ffe17815cede034

Gerrit Updater added a comment - 17/Jul/20 5:47 PM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39435 Subject: LU-9114 ldlm: don't compute sumsq for pool stats Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bbe08711a531af70404ef1ba5ffe17815cede034

Peter Jones added a comment - 14/Dec/17 6:55 PM

Dmitry

Can you please investigate this area as a longer term task for 2018

Peter

Peter Jones added a comment - 14/Dec/17 6:55 PM Dmitry Can you please investigate this area as a longer term task for 2018 Peter

Andreas Dilger added a comment - 20/Apr/17 5:25 PM

In addition to checking if any one MDS thread running too long without scheduling, it may also be that the many MDS kernel threads are scheduled with a higher priority and prevent the userspace threads from being run. I think for pacemaker and such, it makes sense to mlock() the heartbeat daemons into memory (so they aren't swapped) and run them with realtime priority (or something like nice -15) so that they can always get CPU time even when all of the MDS threads are running.

Andreas Dilger added a comment - 20/Apr/17 5:25 PM In addition to checking if any one MDS thread running too long without scheduling, it may also be that the many MDS kernel threads are scheduled with a higher priority and prevent the userspace threads from being run. I think for pacemaker and such, it makes sense to mlock() the heartbeat daemons into memory (so they aren't swapped) and run them with realtime priority (or something like nice -15 ) so that they can always get CPU time even when all of the MDS threads are running.

People

Assignee:: Andreas Dilger

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Feb/17 12:28 AM

Updated:: 17/Dec/20 6:29 PM

Resolved:: 17/Dec/20 6:29 PM