Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12667

Read doesn't perform well in complex NUMA configuration

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • master branch, AMD EYPC CPU
    • 3
    • 9223372036854775807

    Description

      For instance, AMD EPYC 7551(32 CPU cores) has 4 dies on an CPU socket and each die consists of 8 CPU cores and also each numa node.
      If two CPU sockets per client, total 64 CPU cores (128 CPU cores with logical processors) and 8 NUMA nodes.

      # numactl -H
      available: 8 nodes (0-7)
      node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
      node 0 size: 32673 MB
      node 0 free: 31561 MB
      node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
      node 1 size: 32767 MB
      node 1 free: 31930 MB
      node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
      node 2 size: 32767 MB
      node 2 free: 31792 MB
      node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
      node 3 size: 32767 MB
      node 3 free: 31894 MB
      node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
      node 4 size: 32767 MB
      node 4 free: 31892 MB
      node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
      node 5 size: 32767 MB
      node 5 free: 30676 MB
      node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
      node 6 size: 32767 MB
      node 6 free: 30686 MB
      node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
      node 7 size: 32767 MB
      node 7 free: 32000 MB
      node distances:
      node   0   1   2   3   4   5   6   7 
        0:  10  16  16  16  32  32  32  32 
        1:  16  10  16  16  32  32  32  32 
        2:  16  16  10  16  32  32  32  32 
        3:  16  16  16  10  32  32  32  32 
        4:  32  32  32  32  10  16  16  16 
        5:  32  32  32  32  16  10  16  16 
        6:  32  32  32  32  16  16  10  16 
        7:  32  32  32  32  16  16  16  10 
      

      Also first generation EYPC(Naples) has PCI controller per die (NUMA node) and IB HCA connected to one of PCIe controller like below.

      # cat /sys/class/infiniband/mlx5_0/device/numa_node 
      5
      

      mlx5_0 adapter conneted to CPU1's NUMA node1 which is numa node 5 in 2 socket configuration.

      In this case, default LNET performance doesn't perform well and it requires manual CPT setting, but still highly relies on what CPT configuration and cpu cores are involved.
      Here is quick LNET selftest results with default CPT and NUMA aware an CPT configuration.

      default CPT setting(cpu_npartitions=8)
      client:server   PUT(GB/s)  GET(GB/s)
           1:1          7.0        6.8 
           1:2         11.3        3.2
           1:4         11.4        3.4
      
      1 CPT(cpu_npartitions=1 cpu_pattern="0[40-47,104,111]")
      client:server   PUT(GB/s)  GET(GB/s)
           1:1         11.0       11.0
           1:2         11.4       11.4
           1:4         11.4       11.4
      

      numa aware CPT configuration made much better LNET performance, but CPT requires not only LNET, but also all other Lustre client threads. In general, we want to get more number of CPU cores and CPTs involved, but LNET needs to be ware of CPT and NUMA node where network interface installed.

      Attachments

        Issue Links

          Activity

            [LU-12667] Read doesn't perform well in complex NUMA configuration

            Actually, I understood that there is PCI bandwdith limiation on dual port HCA, but 3-4GB is REALLY lower than expected single EDR bandwdith and I was suspecting something NUMA or NUMA/IO or CPT related problem behind. I don't want to get higher bandwdith by number of HCA here, but I am trying number of configurations (e.g. increasing peers, pin CPT to interface, use an dedicate CPT, etc) since as I said before, we are getting better performance on single EPYC client, but once added another CPU, read performance drops.

            sihara Shuichi Ihara added a comment - Actually, I understood that there is PCI bandwdith limiation on dual port HCA, but 3-4GB is REALLY lower than expected single EDR bandwdith and I was suspecting something NUMA or NUMA/IO or CPT related problem behind. I don't want to get higher bandwdith by number of HCA here, but I am trying number of configurations (e.g. increasing peers, pin CPT to interface, use an dedicate CPT, etc) since as I said before, we are getting better performance on single EPYC client, but once added another CPU, read performance drops.
            bobhawkins Bob Hawkins added a comment - - edited

            Perhaps this is why two ports on one HCA are not scaling?

            Examine the card slot:

            One PCIe gen3 lane has max electrical signaling bandwidth of 984.6 MB/s.One “PCIe gen3 x16” slot has sixteen lanes: 16 * 984.6 = 15.75 GB/s max (guaranteed not to exceed)

            And the dual-port HCA:
            One dual-port EDR-IB card requires a x16 slot but “offers” two 100Gb/s (12.5 GB/s) ports.Data encoding allows 64 of 66 bits to be used; 2 bits are for error correction.12.5 GB/s max * (64/66) leaves 12.1 GB/s usable bandwidth for one port to run at full speed.

            Therefore, the 15.75 GB/s “x16” slot only allows one port to run at full 12.1 GB/s EDR-IB speed. Cabling both ports, and assigning LNETs to both ports, without LNET knowing how to apportion bandwidth among the two ports, seems problematic. The x16 slot only provides ~65% of the bandwidth required to run both ports at speed.

            bobhawkins Bob Hawkins added a comment - - edited Perhaps this is why two ports on one HCA are not scaling? Examine the card slot: One PCIe gen3 lane has max electrical signaling bandwidth of 984.6 MB/s. One “PCIe gen3 x16” slot has sixteen lanes: 16 * 984.6 = 15.75 GB/s max (guaranteed not to exceed) And the dual-port HCA: One dual-port EDR-IB card requires a x16 slot but “offers” two 100Gb/s (12.5 GB/s) ports. Data encoding allows 64 of 66 bits to be used; 2 bits are for error correction. 12.5 GB/s max * (64/66) leaves 12.1 GB/s usable bandwidth for one port to run at full speed. Therefore, the 15.75 GB/s “x16” slot only allows one port to run at full 12.1 GB/s EDR-IB speed. Cabling both ports, and assigning LNETs to both ports, without LNET knowing how to apportion bandwidth among the two ports, seems problematic. The x16 slot only provides ~65% of the bandwidth required to run both ports at speed.

            I've observed that if you add two ports on the same HCA as different interfaces to the LNet network there is no performance boost. Performance boost is only seen when you add two different physical HCA cards. Not 100% sure why that is.

            A read test would do an RDMA write from the server to the client. Have you tried a write selftest from the two servers to the client? I'm wondering if you'd get the 11GB/s performance in this case.

            ashehata Amir Shehata (Inactive) added a comment - I've observed that if you add two ports on the same HCA as different interfaces to the LNet network there is no performance boost. Performance boost is only seen when you add two different physical HCA cards. Not 100% sure why that is. A read test would do an RDMA write from the server to the client. Have you tried a write selftest from the two servers to the client? I'm wondering if you'd get the 11GB/s performance in this case.

            If another interface is added on client as multi-rail, read performance bumps up.
            but, it's a bit strange.. if I added an interface which is same numa node as primary interface, performance doesn't scale. but if i added an interface which is different numa node from primary interface, performance improved.

            e.g.

            root@mds15:~# cat /sys/class/net/ib0/device/numa_node 
            5
            root@mds15:~# cat /sys/class/net/ib1/device/numa_node 
            5
            root@mds15:~# cat /sys/class/net/ib2/device/numa_node 
            6
            root@mds15:~# cat /sys/class/net/ib3/device/numa_node 
            6
            
            options lnet networks="o2ib10(ib0)"
            Max Read:  3881.91 MiB/sec (4070.48 MB/sec)
            
            options lnet networks="o2ib10(ib0,ib1)"
            Max Read:  3193.72 MiB/sec (3348.86 MB/sec)
            
            options lnet networks="o2ib10(ib0,ib2)"
            Max Read:  6110.81 MiB/sec (6407.65 MB/sec)
            
            sihara Shuichi Ihara added a comment - If another interface is added on client as multi-rail, read performance bumps up. but, it's a bit strange.. if I added an interface which is same numa node as primary interface, performance doesn't scale. but if i added an interface which is different numa node from primary interface, performance improved. e.g. root@mds15:~# cat /sys/class/net/ib0/device/numa_node 5 root@mds15:~# cat /sys/class/net/ib1/device/numa_node 5 root@mds15:~# cat /sys/class/net/ib2/device/numa_node 6 root@mds15:~# cat /sys/class/net/ib3/device/numa_node 6 options lnet networks="o2ib10(ib0)" Max Read: 3881.91 MiB/sec (4070.48 MB/sec) options lnet networks="o2ib10(ib0,ib1)" Max Read: 3193.72 MiB/sec (3348.86 MB/sec) options lnet networks="o2ib10(ib0,ib2)" Max Read: 6110.81 MiB/sec (6407.65 MB/sec)

            OK, here is a test result which configured only an CPT and allocates all cpus in numa node 5 into that CPT.
            like this

            options lnet networks="o2ib10(ib0)"
            options libcfs cpu_npartitions=1 cpu_pattern="0[40-47,104,111]"
            
            # cat /sys/kernel/debug/lnet/cpu_partition_table
            0	: 40 41 42 43 44 45 46 47 104 111
            

            LNET selftest performs 11GB/sec from single client to against either 2 or 4 servers, but IOR read still hit 4GB/sec against 2 OSS (not only two OSS, but also any number of multiple servers)
            If number of OSS reduced to 1, performance goes up 8GB/sec. This is exact same IOR results above.

            sihara Shuichi Ihara added a comment - OK, here is a test result which configured only an CPT and allocates all cpus in numa node 5 into that CPT. like this options lnet networks="o2ib10(ib0)" options libcfs cpu_npartitions=1 cpu_pattern="0[40-47,104,111]" # cat /sys/kernel/debug/lnet/cpu_partition_table 0 : 40 41 42 43 44 45 46 47 104 111 LNET selftest performs 11GB/sec from single client to against either 2 or 4 servers, but IOR read still hit 4GB/sec against 2 OSS (not only two OSS, but also any number of multiple servers) If number of OSS reduced to 1, performance goes up 8GB/sec. This is exact same IOR results above.

            One thing to consider is that when RDMAing to/from buffers, these buffers are allocated on a specific NUMA node. They could be spread across all the NUMA nodes. If the NUMA node the buffer is allocated on is far from the IB interface doing the RDMAing, then it would impact performance. When we were doing the MR testing we noticed a significant impact due to these NUMA penalties. Granted they were on large UV machine, but the same problem could be happening here as well.

            One thing to try is to restrict buffer allocation to NUMA node 5. Can we try this and see how it impacts performance.

            ashehata Amir Shehata (Inactive) added a comment - One thing to consider is that when RDMAing to/from buffers, these buffers are allocated on a specific NUMA node. They could be spread across all the NUMA nodes. If the NUMA node the buffer is allocated on is far from the IB interface doing the RDMAing, then it would impact performance. When we were doing the MR testing we noticed a significant impact due to these NUMA penalties. Granted they were on large UV machine, but the same problem could be happening here as well. One thing to try is to restrict buffer allocation to NUMA node 5. Can we try this and see how it impacts performance.

            this is not striped file and it's file-per-process..

            sihara Shuichi Ihara added a comment - this is not striped file and it's file-per-process..

            FYI, there is a known problem for read for striped files:

            https://review.whamcloud.com/#/c/35438/

            This should help read for striped different OST/OSS performances I guess.

            wshilong Wang Shilong (Inactive) added a comment - FYI, there is a known problem for read for striped files: https://review.whamcloud.com/#/c/35438/ This should help read for striped different OST/OSS performances I guess.

            it's same read preformance degradation regardless CPT binding or not.
            But, at least, I saw good lnet selftest performance even against multiple OSSs with CPT binding, but when client does actual IO read operation, perforamnce doesn't scale by nunber of OSS.

            sihara Shuichi Ihara added a comment - it's same read preformance degradation regardless CPT binding or not. But, at least, I saw good lnet selftest performance even against multiple OSSs with CPT binding, but when client does actual IO read operation, perforamnce doesn't scale by nunber of OSS.

            Well, that makes sense to me if it's a CPT binding issue of some kind, because the CPT binding is linked to the OSS, not OST.  And the CPT binding stuff in Lustre on the client mostly matters at the Lnet/o2ib type layers, as you know, so...  That sort of fits.

            Hm.  I'll reply to your email.

            pfarrell Patrick Farrell (Inactive) added a comment - Well, that makes sense to me if it's a CPT binding issue of some kind, because the CPT binding is linked to the OSS, not OST.  And the CPT binding stuff in Lustre on the client mostly matters at the Lnet/o2ib type layers, as you know, so...  That sort of fits. Hm.  I'll reply to your email.

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: