[LU-9441] Use kernel threads in predictable fashion to confine OS noise - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0
Affects Version/s: None
Labels:
None

Epic/Theme:
- Performance
- jitter
Rank (Obsolete):
9223372036854775807

Description

During benchmarking at large scale on KNL+Omnipath system (8k nodes) we saw periodic OS noise due to lustre kernel threads that greatly affects the performance of small-message MPI collectives (MPI_Barrier, MPI_Allreduce with small datasize, etc).

The request is for lustre to use kernel threads in a deterministic manner when the load is low. In this case the ideal usage would have been to use only the thread(s) in CPT 0 which can be setup to run only on KNL tile 0 (cores 0 and 1). Then when no significant I/O is going on, a benchmark thread can be bound to a tile other than 0 and not see any lustre noise.

This is more than just a benchmarking setting, it is common for allreduce to be a bottleneck especially at scale, and usually HPC applications have long phases with no I/O.

Attachments

Issue Links

is related to

LU-9660 reduce ptlrpcd wakeups on idle system

Resolved

LU-4423 Tracking of patches from upstream kernel to Lustre client

Resolved

LU-7236 OST connect and disconnect on demand

Resolved

LU-13258 Bind linux workqueues to specific core

Resolved

Activity

[LU-9441] Use kernel threads in predictable fashion to confine OS noise

Peter Jones added a comment - 17/Feb/21 11:20 PM

2.14 is closing so let's track anything else under a new ticket

Peter Jones added a comment - 17/Feb/21 11:20 PM 2.14 is closing so let's track anything else under a new ticket

Andreas Dilger added a comment - 20/Jun/20 8:09 AM

James, that could be moved to a separate ticket.

Andreas Dilger added a comment - 20/Jun/20 8:09 AM James, that could be moved to a separate ticket.

James A Simmons added a comment - 19/Jun/20 11:54 PM

Andreas wants an offset to running the threads based on the jobid. I think that is one more patch.

James A Simmons added a comment - 19/Jun/20 11:54 PM Andreas wants an offset to running the threads based on the jobid. I think that is one more patch.

Peter Jones added a comment - 19/Jun/20 10:11 PM

Landed for 2.14

Peter Jones added a comment - 19/Jun/20 10:11 PM Landed for 2.14

Gerrit Updater added a comment - 19/Jun/20 4:50 PM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/
Subject: ~~LU-9441~~ llite: bind kthread thread to accepted node set
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d6e103e6950d99b88141d4b26982889258c774c5

Gerrit Updater added a comment - 19/Jun/20 4:50 PM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/ Subject: LU-9441 llite: bind kthread thread to accepted node set Project: fs/lustre-release Branch: master Current Patch Set: Commit: d6e103e6950d99b88141d4b26982889258c774c5

Gerrit Updater added a comment - 27/May/20 5:28 PM

James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730
Subject: ~~LU-9441~~ llite: bind kthread thread to accepted node set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b0e24c39a994463b6eaaa8c424c33b3815280ad

Gerrit Updater added a comment - 27/May/20 5:28 PM James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730 Subject: LU-9441 llite: bind kthread thread to accepted node set Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3b0e24c39a994463b6eaaa8c424c33b3815280ad

Andreas Dilger added a comment - 28/Mar/20 5:37 AM

Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented.

Andreas Dilger added a comment - 28/Mar/20 5:37 AM Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented.

Andreas Dilger added a comment - 28/Mar/20 5:35 AM

Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication.

I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis.

As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. "(seconds & interval) == (hash(jobid) & interval)", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds.

The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible.

Andreas Dilger added a comment - 28/Mar/20 5:35 AM Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication. I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis. As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. " (seconds & interval) == (hash(jobid) & interval) ", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds. The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible.

James A Simmons added a comment - 27/Mar/20 7:34 PM

https://review.whamcloud.com/#/c/38091/ should limit the pinger noise. Any other ideas?

James A Simmons added a comment - 27/Mar/20 7:34 PM https://review.whamcloud.com/#/c/38091/ should limit the pinger noise. Any other ideas?

Andreas Dilger added a comment - 17/Sep/18 11:31 PM

Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job.

Andreas Dilger added a comment - 17/Sep/18 11:31 PM Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job.

People

Assignee:: James A Simmons

Reporter:: Larry Meadows (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/May/17 2:16 PM

Updated:: 17/Feb/21 11:20 PM

Resolved:: 17/Feb/21 11:20 PM