[LU-9114] Make MDS (And other server threads?) hog CPU less Created: 14/Feb/17  Updated: 17/Dec/20  Resolved: 17/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Minor
Reporter: Oleg Drokin Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-4423 Tracking of patches from upstream ker... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

It's somewhat widely seen in various logs that pacemaker complaints its thread was not scheduled for tens of seconds which is way too excessive.
Indeed MDS is pretty cpu hungry, but we need to ensure we insert enough of schedule points so that other processes get a shot at CPU too.

There are also some bandaids discussed like using numa settings to cordon off one cpu from use by Lustre, but those are just that - bandaids.

We probably can play with various debug settings that warn about this and make the timeouts lower to try and catch more of the offenders. Likely have a bunch in flock code with its double loops



 Comments   
Comment by Andreas Dilger [ 20/Apr/17 ]

In addition to checking if any one MDS thread running too long without scheduling, it may also be that the many MDS kernel threads are scheduled with a higher priority and prevent the userspace threads from being run. I think for pacemaker and such, it makes sense to mlock() the heartbeat daemons into memory (so they aren't swapped) and run them with realtime priority (or something like nice -15) so that they can always get CPU time even when all of the MDS threads are running.

Comment by Peter Jones [ 14/Dec/17 ]

Dmitry

Can you please investigate this area as a longer term task for 2018

Peter

Comment by Gerrit Updater [ 17/Jul/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39435
Subject: LU-9114 ldlm: don't compute sumsq for pool stats
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bbe08711a531af70404ef1ba5ffe17815cede034

Comment by Gerrit Updater [ 17/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39435/
Subject: LU-9114 ldlm: don't compute sumsq for pool stats
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 966f6bb550be52e4bf5dd7fd38a0d707fe2a5072

Comment by Peter Jones [ 17/Dec/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:23:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.