[LU-9441] Use kernel threads in predictable fashion to confine OS noise Created: 03/May/17  Updated: 17/Feb/21  Resolved: 17/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Improvement Priority: Major
Reporter: Larry Meadows (Inactive) Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9660 reduce ptlrpcd wakeups on idle system Resolved
is related to LU-4423 Tracking of patches from upstream ker... Resolved
is related to LU-7236 connections on demand Resolved
is related to LU-13258 Bind linux workqueues to specific core Resolved
Epic/Theme: Performance, jitter
Rank (Obsolete): 9223372036854775807

 Description   

During benchmarking at large scale on KNL+Omnipath system (8k nodes) we saw periodic OS noise due to lustre kernel threads that greatly affects the performance of small-message MPI collectives (MPI_Barrier, MPI_Allreduce with small datasize, etc).

The request is for lustre to use kernel threads in a deterministic manner when the load is low. In this case the ideal usage would have been to use only the thread(s) in CPT 0 which can be setup to run only on KNL tile 0 (cores 0 and 1). Then when no significant I/O is going on, a benchmark thread can be bound to a tile other than 0 and not see any lustre noise.

This is more than just a benchmarking setting, it is common for allreduce to be a bottleneck especially at scale, and usually HPC applications have long phases with no I/O.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 04/May/17 ]

Hi Dmitry,

Can you please investigate this report?

Thanks.
Joe

Comment by Larry Meadows (Inactive) [ 10/May/17 ]

Looking at the lustre code for ptlrpdc it appears that every ptlrpcd thread wakes up at least 1x/second regardless of activity.
I have profiles from the KNL machine at TACC (Stampede) showing activity on 60 different lustre threads even with no I/O going on. Kernel tracing also confirms this.
I need a resolution for this, it seriously affects performance of MPI collectives, especially as the node count increases.
Please respond with your plans to resolve this issue.

Comment by Joseph Gmitter (Inactive) [ 10/May/17 ]

Hi Larry,

This is on Dmitry's plate to investigate and propose potential solutions. We will post to the ticket as soon as possible.

Thanks.
Joe

Comment by Gerrit Updater [ 11/Aug/17 ]

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28496
Subject: LU-9441 pltrpc: don't wakeup on 1 second intervals
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6ddeed98c9f65486743b2a30b7ea0468fb4cf952

Comment by Andreas Dilger [ 13/Apr/18 ]

The LU-9660 patch was landed, but I'm wondering if there is more we can do here. For example, batching inodes by NID ranges (i.e. submask) so that they ping at the same time. For example, if we scheduled pings on clients so that seconds % interval == (NID >> 9) % interval was true, then groups of 2^9 = 512 nodes would ping in the same second. For jobs that are typically scheduled with "nearby" NIDs, or for small clusters, this would put the ping overhead in one timestep, instead of introducing jitter across all timesteps.

Landing LU-7236 would help this further, as we don't need pinging if the client has disconnected from the server(s) (at the expense of a small latency hit when it next sends an RPC to the server again).

Comment by Andreas Dilger [ 17/Sep/18 ]

Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job.

Comment by James A Simmons [ 27/Mar/20 ]

https://review.whamcloud.com/#/c/38091/ should limit the pinger noise. Any other ideas?

Comment by Andreas Dilger [ 28/Mar/20 ]

Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication.

I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis.

As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. "(seconds & interval) == (hash(jobid) & interval)", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds.

The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible.

Comment by Andreas Dilger [ 28/Mar/20 ]

Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented.

Comment by Gerrit Updater [ 27/May/20 ]

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730
Subject: LU-9441 llite: bind kthread thread to accepted node set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b0e24c39a994463b6eaaa8c424c33b3815280ad

Comment by Gerrit Updater [ 19/Jun/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/
Subject: LU-9441 llite: bind kthread thread to accepted node set
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d6e103e6950d99b88141d4b26982889258c774c5

Comment by Peter Jones [ 19/Jun/20 ]

Landed for 2.14

Comment by James A Simmons [ 19/Jun/20 ]

Andreas wants an offset to running the threads based on the jobid. I think that is one more patch.

Comment by Andreas Dilger [ 20/Jun/20 ]

James, that could be moved to a separate ticket.

Comment by Peter Jones [ 17/Feb/21 ]

2.14 is closing so let's track anything else under a new ticket

Generated at Sat Feb 10 02:26:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.