[LU-9441] Use kernel threads in predictable fashion to confine OS noise Created: 03/May/17 Updated: 17/Feb/21 Resolved: 17/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Improvement | Priority: | Major |
| Reporter: | Larry Meadows (Inactive) | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Epic/Theme: | Performance, jitter | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
During benchmarking at large scale on KNL+Omnipath system (8k nodes) we saw periodic OS noise due to lustre kernel threads that greatly affects the performance of small-message MPI collectives (MPI_Barrier, MPI_Allreduce with small datasize, etc). The request is for lustre to use kernel threads in a deterministic manner when the load is low. In this case the ideal usage would have been to use only the thread(s) in CPT 0 which can be setup to run only on KNL tile 0 (cores 0 and 1). Then when no significant I/O is going on, a benchmark thread can be bound to a tile other than 0 and not see any lustre noise. This is more than just a benchmarking setting, it is common for allreduce to be a bottleneck especially at scale, and usually HPC applications have long phases with no I/O. |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 04/May/17 ] |
|
Hi Dmitry, Can you please investigate this report? Thanks. |
| Comment by Larry Meadows (Inactive) [ 10/May/17 ] |
|
Looking at the lustre code for ptlrpdc it appears that every ptlrpcd thread wakes up at least 1x/second regardless of activity. |
| Comment by Joseph Gmitter (Inactive) [ 10/May/17 ] |
|
Hi Larry, This is on Dmitry's plate to investigate and propose potential solutions. We will post to the ticket as soon as possible. Thanks. |
| Comment by Gerrit Updater [ 11/Aug/17 ] |
|
Amir Shehata (amir.shehata@intel.com) uploaded a new patch: https://review.whamcloud.com/28496 |
| Comment by Andreas Dilger [ 13/Apr/18 ] |
|
The Landing |
| Comment by Andreas Dilger [ 17/Sep/18 ] |
|
Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job. |
| Comment by James A Simmons [ 27/Mar/20 ] |
|
https://review.whamcloud.com/#/c/38091/ should limit the pinger noise. Any other ideas? |
| Comment by Andreas Dilger [ 28/Mar/20 ] |
|
Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication. I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis. As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. "(seconds & interval) == (hash(jobid) & interval)", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds. The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible. |
| Comment by Andreas Dilger [ 28/Mar/20 ] |
|
Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented. |
| Comment by Gerrit Updater [ 27/May/20 ] |
|
James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730 |
| Comment by Gerrit Updater [ 19/Jun/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/ |
| Comment by Peter Jones [ 19/Jun/20 ] |
|
Landed for 2.14 |
| Comment by James A Simmons [ 19/Jun/20 ] |
|
Andreas wants an offset to running the threads based on the jobid. I think that is one more patch. |
| Comment by Andreas Dilger [ 20/Jun/20 ] |
|
James, that could be moved to a separate ticket. |
| Comment by Peter Jones [ 17/Feb/21 ] |
|
2.14 is closing so let's track anything else under a new ticket |