Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7245

Improve SMP scaling support for LND drivers

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.9.0
    • Lustre 2.9.0
    • Any TCP, infiniband or Gemini network system
    • 9223372036854775807

    Description

      While working on enhancing the lnetctl utility it was discovered that more SMP scaling improvements can be done to the currently supported LND driver.

      Attachments

        Issue Links

          Activity

            [LU-7245] Improve SMP scaling support for LND drivers

            Patches for this work already landed and the multi-rail work filled in the rest of the gaps.

            simmonsja James A Simmons added a comment - Patches for this work already landed and the multi-rail work filled in the rest of the gaps.

            I suspect more work will coming from this ticket for 2.9.

            simmonsja James A Simmons added a comment - I suspect more work will coming from this ticket for 2.9.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16710/
            Subject: LU-7245 socklnd: Bind peers to a specific CPT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 68eb6e841f49d41a289bd0b3f559973b6cb31738

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16710/ Subject: LU-7245 socklnd: Bind peers to a specific CPT Project: fs/lustre-release Branch: master Current Patch Set: Commit: 68eb6e841f49d41a289bd0b3f559973b6cb31738

            The patch for the Gemini SMP work under LU-2544 reworked a lot of the data structures to deal with what you described. Never looked closely at the other LND drivers but I can see the problem there. Will require a lot of data structure reworking :-/

            simmonsja James A Simmons added a comment - The patch for the Gemini SMP work under LU-2544 reworked a lot of the data structures to deal with what you described. Never looked closely at the other LND drivers but I can see the problem there. Will require a lot of data structure reworking :-/

            James, since you asked, a few notes about NUMA.

            Reasoning about NUMA is a lot like thinking about a cluster, except you spend a lot of time worrying about cache lines instead of files. (Much of the terminology is also similar, which becomes confusing when discussing NUMA issues for systems that are part of a cluster.) It is worth noting that NUMA considerations already apply once a system has more than one socket. Ideally, the process driving I/O, the memory involved, and the interface involved all live on the same socket.

            In practice, the memory placement may have been done by some user space process outside our control, and the same goes for the process that initiates the I/O. Selection of an interface that is close in the topology of the system is useful, but that already assumes a multi-rail type configuration. Much of the time there will be no choice because there is only one interface. (LNet routers are an exception: there we do have full control over the location of all buffers and threads relative to the interfaces used.)

            So the main concern becomes doing the best we can in areas we do control, in particular avoiding cache line bouncing. Placing a datastructure like ksock_peer in the same CPT as the interface helps a little bit here, but only a little. The layout of the ksock_peer structure is actually a good example of what not to do. Take a look at the first few members, which will likely all be in the same cache line:

            typedef struct ksock_peer
            {
                    struct list_head        ksnp_list;      /* stash on global peer list */
                    cfs_time_t            ksnp_last_alive;  /* when (in jiffies) I was last alive */
                    lnet_process_id_t     ksnp_id;       /* who's on the other end(s) */
                    atomic_t              ksnp_refcount; /* # users */
                    int                   ksnp_sharecount;  /* lconf usage counter */
            

            ksnp_list and ksnp_id are semi-constant, and read by any thread that does a lookup of some peer in the hash table (shared/read access). In contrast ksnp_refcount and ksnp_last_alive are updated by threads doing work for this particular peer (exclusive/write access). So a lookup of some unrelated peer causes a cache line bounce between the CPU doing the lookup and the CPU managing the I/O. This particular case can be mitigated by being very careful with the layout of a datastructure, and by making sure that threads that do modify the structure run on the same socket, even if that socket is not where the datastructure lives.

            olaf Olaf Weber (Inactive) added a comment - James, since you asked, a few notes about NUMA. Reasoning about NUMA is a lot like thinking about a cluster, except you spend a lot of time worrying about cache lines instead of files. (Much of the terminology is also similar, which becomes confusing when discussing NUMA issues for systems that are part of a cluster.) It is worth noting that NUMA considerations already apply once a system has more than one socket. Ideally, the process driving I/O, the memory involved, and the interface involved all live on the same socket. In practice, the memory placement may have been done by some user space process outside our control, and the same goes for the process that initiates the I/O. Selection of an interface that is close in the topology of the system is useful, but that already assumes a multi-rail type configuration. Much of the time there will be no choice because there is only one interface. (LNet routers are an exception: there we do have full control over the location of all buffers and threads relative to the interfaces used.) So the main concern becomes doing the best we can in areas we do control, in particular avoiding cache line bouncing. Placing a datastructure like ksock_peer in the same CPT as the interface helps a little bit here, but only a little. The layout of the ksock_peer structure is actually a good example of what not to do. Take a look at the first few members, which will likely all be in the same cache line: typedef struct ksock_peer { struct list_head ksnp_list; /* stash on global peer list */ cfs_time_t ksnp_last_alive; /* when (in jiffies) I was last alive */ lnet_process_id_t ksnp_id; /* who's on the other end(s) */ atomic_t ksnp_refcount; /* # users */ int ksnp_sharecount; /* lconf usage counter */ ksnp_list and ksnp_id are semi-constant, and read by any thread that does a lookup of some peer in the hash table (shared/read access). In contrast ksnp_refcount and ksnp_last_alive are updated by threads doing work for this particular peer (exclusive/write access). So a lookup of some unrelated peer causes a cache line bounce between the CPU doing the lookup and the CPU managing the I/O. This particular case can be mitigated by being very careful with the layout of a datastructure, and by making sure that threads that do modify the structure run on the same socket, even if that socket is not where the datastructure lives.
            pjones Peter Jones added a comment -

            James

            Are you thinking of targeting this work for 2.9?

            Peter

            pjones Peter Jones added a comment - James Are you thinking of targeting this work for 2.9? Peter

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/16710
            Subject: LU-7245 socklnd: Bind peers to a specific CPT
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 05f09aec7aecbd1d57eb321da2f4a65056d9a483

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: http://review.whamcloud.com/16710 Subject: LU-7245 socklnd: Bind peers to a specific CPT Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 05f09aec7aecbd1d57eb321da2f4a65056d9a483

            People

              simmonsja James A Simmons
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: