Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9441

Use kernel threads in predictable fashion to confine OS noise

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • None
    • None

    Description

      During benchmarking at large scale on KNL+Omnipath system (8k nodes) we saw periodic OS noise due to lustre kernel threads that greatly affects the performance of small-message MPI collectives (MPI_Barrier, MPI_Allreduce with small datasize, etc).

      The request is for lustre to use kernel threads in a deterministic manner when the load is low. In this case the ideal usage would have been to use only the thread(s) in CPT 0 which can be setup to run only on KNL tile 0 (cores 0 and 1). Then when no significant I/O is going on, a benchmark thread can be bound to a tile other than 0 and not see any lustre noise.

      This is more than just a benchmarking setting, it is common for allreduce to be a bottleneck especially at scale, and usually HPC applications have long phases with no I/O.

      Attachments

        Issue Links

          Activity

            [LU-9441] Use kernel threads in predictable fashion to confine OS noise
            pjones Peter Jones added a comment -

            2.14 is closing so let's track anything else under a new ticket

            pjones Peter Jones added a comment - 2.14 is closing so let's track anything else under a new ticket

            James, that could be moved to a separate ticket.

            adilger Andreas Dilger added a comment - James, that could be moved to a separate ticket.

            Andreas wants an offset to running the threads based on the jobid. I think that is one more patch.

            simmonsja James A Simmons added a comment - Andreas wants an offset to running the threads based on the jobid. I think that is one more patch.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/
            Subject: LU-9441 llite: bind kthread thread to accepted node set
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d6e103e6950d99b88141d4b26982889258c774c5

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38730/ Subject: LU-9441 llite: bind kthread thread to accepted node set Project: fs/lustre-release Branch: master Current Patch Set: Commit: d6e103e6950d99b88141d4b26982889258c774c5

            James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730
            Subject: LU-9441 llite: bind kthread thread to accepted node set
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3b0e24c39a994463b6eaaa8c424c33b3815280ad

            gerrit Gerrit Updater added a comment - James Simmons (jsimmons@infradead.org) uploaded a new patch: https://review.whamcloud.com/38730 Subject: LU-9441 llite: bind kthread thread to accepted node set Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3b0e24c39a994463b6eaaa8c424c33b3815280ad

            Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented.

            adilger Andreas Dilger added a comment - Some of this was discussed in the context of https://review.whamcloud.com/36701 but I don't think it was implemented.

            Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication.

            I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis.

            As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. "(seconds & interval) == (hash(jobid) & interval)", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds.

            The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible.

            adilger Andreas Dilger added a comment - Your patch isolates the pinger thread on the client, but it doesn't do anything to isolate the network traffic to avoid jitter in the communication. I think one of the things that has been lost with the timer changes for the ping (and other) wakeups is coordination between the threads and clients. On the one hand, you don't want all clients and threads to wake at the same time, but some amount of coordination (i.e. a few bad timesteps across all threads at one time and then many good timesteps in a row) is better than each thread having a different bad timestep on a continual basis. As I described in my previous comments here, aligning the client pings by NID or JobID (e.g. " (seconds & interval) == (hash(jobid) & interval) ", where interval is a power-of-two value close to the ping interval) would minimize the number of bad timesteps and maximize good ones. The synchronicity would depend on how well-sync'd the client clocks are, but it could be isolated to a few msec per tens of seconds. The same is true of some of the other timeouts - they could be aligned to happen on a coarse-grained interval (every second or few seconds) rather than randomly, so that when some work needs to be done, there is a bunch, and then it is quiet again as long as possible.

            https://review.whamcloud.com/#/c/38091/ should limit the pinger noise. Any other ideas?

            simmonsja James A Simmons added a comment - https://review.whamcloud.com/#/c/38091/  should limit the pinger noise. Any other ideas?

            Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job.

            adilger Andreas Dilger added a comment - Similarly, if a single application was batching up operations for the servers in a coordinated manner between clients, this should be staggered to avoid contention between jobs, and would avoid adding random jitter to all timesteps of a computation. Only one timestep per period would be impacted by the background work. Something like ping on time % (hash(jobid) & 15) or similar could be used to coordinate autonomously between clients running the same job.

            People

              simmonsja James A Simmons
              lfmeadow Larry Meadows (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: