Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12667

Read doesn't perform well in complex NUMA configuration

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • master branch, AMD EYPC CPU
    • 3
    • 9223372036854775807

    Description

      For instance, AMD EPYC 7551(32 CPU cores) has 4 dies on an CPU socket and each die consists of 8 CPU cores and also each numa node.
      If two CPU sockets per client, total 64 CPU cores (128 CPU cores with logical processors) and 8 NUMA nodes.

      # numactl -H
      available: 8 nodes (0-7)
      node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
      node 0 size: 32673 MB
      node 0 free: 31561 MB
      node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
      node 1 size: 32767 MB
      node 1 free: 31930 MB
      node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
      node 2 size: 32767 MB
      node 2 free: 31792 MB
      node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
      node 3 size: 32767 MB
      node 3 free: 31894 MB
      node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
      node 4 size: 32767 MB
      node 4 free: 31892 MB
      node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
      node 5 size: 32767 MB
      node 5 free: 30676 MB
      node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
      node 6 size: 32767 MB
      node 6 free: 30686 MB
      node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
      node 7 size: 32767 MB
      node 7 free: 32000 MB
      node distances:
      node   0   1   2   3   4   5   6   7 
        0:  10  16  16  16  32  32  32  32 
        1:  16  10  16  16  32  32  32  32 
        2:  16  16  10  16  32  32  32  32 
        3:  16  16  16  10  32  32  32  32 
        4:  32  32  32  32  10  16  16  16 
        5:  32  32  32  32  16  10  16  16 
        6:  32  32  32  32  16  16  10  16 
        7:  32  32  32  32  16  16  16  10 
      

      Also first generation EYPC(Naples) has PCI controller per die (NUMA node) and IB HCA connected to one of PCIe controller like below.

      # cat /sys/class/infiniband/mlx5_0/device/numa_node 
      5
      

      mlx5_0 adapter conneted to CPU1's NUMA node1 which is numa node 5 in 2 socket configuration.

      In this case, default LNET performance doesn't perform well and it requires manual CPT setting, but still highly relies on what CPT configuration and cpu cores are involved.
      Here is quick LNET selftest results with default CPT and NUMA aware an CPT configuration.

      default CPT setting(cpu_npartitions=8)
      client:server   PUT(GB/s)  GET(GB/s)
           1:1          7.0        6.8 
           1:2         11.3        3.2
           1:4         11.4        3.4
      
      1 CPT(cpu_npartitions=1 cpu_pattern="0[40-47,104,111]")
      client:server   PUT(GB/s)  GET(GB/s)
           1:1         11.0       11.0
           1:2         11.4       11.4
           1:4         11.4       11.4
      

      numa aware CPT configuration made much better LNET performance, but CPT requires not only LNET, but also all other Lustre client threads. In general, we want to get more number of CPU cores and CPTs involved, but LNET needs to be ware of CPT and NUMA node where network interface installed.

      Attachments

        Issue Links

          Activity

            [LU-12667] Read doesn't perform well in complex NUMA configuration

            People

              wc-triage WC Triage
              sihara Shuichi Ihara
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated: