<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 02:54:36 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-12667] Read doesn&apos;t perform well in complex NUMA configuration</title>
                <link>https://jira.whamcloud.com/browse/LU-12667</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;For instance, AMD EPYC 7551(32 CPU cores) has 4 dies on an CPU socket and each die consists of 8 CPU cores and also each numa node.&lt;br/&gt;
 If two CPU sockets per client, total 64 CPU cores (128 CPU cores with logical processors) and 8 NUMA nodes.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 32673 MB
node 0 free: 31561 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 32767 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 32767 MB
node 2 free: 31792 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 32767 MB
node 3 free: 31894 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 32767 MB
node 4 free: 31892 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 32767 MB
node 5 free: 30676 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 32767 MB
node 6 free: 30686 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 32767 MB
node 7 free: 32000 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  16  32  32  32  32 
  1:  16  10  16  16  32  32  32  32 
  2:  16  16  10  16  32  32  32  32 
  3:  16  16  16  10  32  32  32  32 
  4:  32  32  32  32  10  16  16  16 
  5:  32  32  32  32  16  10  16  16 
  6:  32  32  32  32  16  16  10  16 
  7:  32  32  32  32  16  16  16  10 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Also first generation EYPC(Naples) has PCI controller per die (NUMA node) and IB HCA connected to one of PCIe controller like below.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# cat /sys/class/infiniband/mlx5_0/device/numa_node 
5
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;mlx5_0 adapter conneted to CPU1&apos;s NUMA node1 which is numa node 5 in 2 socket configuration.&lt;/p&gt;

&lt;p&gt;In this case, default LNET performance doesn&apos;t perform well and it requires manual CPT setting, but still highly relies on what CPT configuration and cpu cores are involved.&lt;br/&gt;
 Here is quick LNET selftest results with default CPT and NUMA aware an CPT configuration.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;default CPT setting(cpu_npartitions=8)
client:server   PUT(GB/s)  GET(GB/s)
     1:1          7.0        6.8 
     1:2         11.3        3.2
     1:4         11.4        3.4

1 CPT(cpu_npartitions=1 cpu_pattern=&quot;0[40-47,104,111]&quot;)
client:server   PUT(GB/s)  GET(GB/s)
     1:1         11.0       11.0
     1:2         11.4       11.4
     1:4         11.4       11.4
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;numa aware CPT configuration made much better LNET performance, but CPT requires not only LNET, but also all other Lustre client threads. In general, we want to get more number of CPU cores and CPTs involved, but LNET needs to be ware of CPT and NUMA node where network interface installed.&lt;/p&gt;</description>
                <environment>master branch, AMD EYPC CPU</environment>
        <key id="56667">LU-12667</key>
            <summary>Read doesn&apos;t perform well in complex NUMA configuration</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.whamcloud.com/images/icons/priorities/major.svg">Major</priority>
                        <status id="1" iconUrl="https://jira.whamcloud.com/images/icons/statuses/open.png" description="The issue is open and ready for the assignee to start work on it.">Open</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="wc-triage">WC Triage</assignee>
                                    <reporter username="sihara">Shuichi Ihara</reporter>
                        <labels>
                    </labels>
                <created>Thu, 15 Aug 2019 14:19:41 +0000</created>
                <updated>Mon, 8 Jun 2020 22:19:40 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>13</watches>
                                                                            <comments>
                            <comment id="253130" author="sihara" created="Thu, 15 Aug 2019 14:55:12 +0000"  >&lt;p&gt;Then, assign of an CPT to NI works. automated fine tunning (e.g defects numa node for NI) might be better, but manual setting is workable solsuion as a workaround.&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib10(ib0)[5]&quot;

# cat /sys/kernel/debug/lnet/cpu_partition_table
0	: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
1	: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
2	: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
3	: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
4	: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
5	: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
6	: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
7	: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="253134" author="ashehata" created="Thu, 15 Aug 2019 15:27:24 +0000"  >&lt;p&gt;For the sake of keeping a record. As discussed:&lt;/p&gt;

&lt;p&gt;the performance issue comes from the fact that the LND threads are spread across all the CPTs&lt;br/&gt;
I&apos;m guessing in this NUMA configuration that has a performance impact&lt;br/&gt;
by restricting the NI on the set of CPTs you&apos;re interested in, then the LND threads are only spawned on these cores. &lt;br/&gt;
RDMA is more efficient since it doesn&apos;t have to cross NUMA boundary.&lt;/p&gt;

&lt;p&gt;The issue with automated tuning is that there is no criteria to base the tuning on in this case. Do you have suggestion on how to automate this config?&lt;/p&gt;</comment>
                            <comment id="253355" author="sihara" created="Wed, 21 Aug 2019 09:05:19 +0000"  >&lt;p&gt;Originally, we wanted good single client write and read performance, but initial read performance was bad. I was thinking this was because cpt configuration for LNET was not optimal for such numa node configuration. However, problem seems to be even complex and still not sure this is LNET problem or others.&lt;/p&gt;

&lt;p&gt;Here is quick test results.&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;# cat /etc/modprobe.d/lustre.conf 
options lnet networks=&quot;o2ib10(ib0)[5]&quot;
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;LNET selftest (RPC PUT)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;client - oss1 11.0GB/sec
client - oss2 11.0GB/sec
client - oss1,oss2 (distributed) 11.4GB/sec 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;All lnet performance seems to be good.&lt;/p&gt;

&lt;p&gt;IOR (read, FPP, 1MB)&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;mpirun -np 32 /work/tools/bin/ior -o /scratch0/2ost/file -a POSIX -r -e -b 4g -t 1m -F -C -vv  -k
client - oss1 (2xOST) 8.1GB/sec 
client - oss2 (2xOST) 8.1GB/sec
client - oss1,oss2 (4xOST) distributed 3.9GB/sec
client - oss1 (4xOST) 8.1GB/sec
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If client read data from single OSS, performance is reasonable (still not perfect, but not so bad), but if client talks multiple OSSs, read performance drops.&lt;br/&gt;
 I thought it&apos;s related to numbers of OSC, but as far as I tested same same 4xOSC, but it was single OSS, performance is good. I will dig more..&lt;/p&gt;</comment>
                            <comment id="253394" author="pfarrell" created="Wed, 21 Aug 2019 14:30:54 +0000"  >&lt;p&gt;Well, that makes sense to me if it&apos;s a CPT binding issue of some kind, because the CPT binding is linked to the OSS, not OST.&#160; And the CPT binding stuff in Lustre on the client mostly matters at the Lnet/o2ib type layers, as you know, so...&#160; That sort of fits.&lt;/p&gt;

&lt;p&gt;Hm.&#160; I&apos;ll reply to your email.&lt;/p&gt;</comment>
                            <comment id="253398" author="sihara" created="Wed, 21 Aug 2019 14:52:37 +0000"  >&lt;p&gt;it&apos;s same read preformance degradation regardless CPT binding or not.&lt;br/&gt;
But, at least, I saw good lnet selftest performance even against multiple OSSs with CPT binding, but when client does actual IO read operation, perforamnce doesn&apos;t scale by nunber of OSS.&lt;/p&gt;</comment>
                            <comment id="253399" author="wshilong" created="Wed, 21 Aug 2019 15:04:27 +0000"  >&lt;p&gt;FYI, there is a known problem for read for striped files:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://review.whamcloud.com/#/c/35438/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://review.whamcloud.com/#/c/35438/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This should help read for striped different OST/OSS performances I guess.&lt;/p&gt;</comment>
                            <comment id="253400" author="sihara" created="Wed, 21 Aug 2019 15:08:14 +0000"  >&lt;p&gt;this is not striped file and it&apos;s file-per-process.. &lt;/p&gt;</comment>
                            <comment id="253421" author="ashehata" created="Thu, 22 Aug 2019 00:13:33 +0000"  >&lt;p&gt;One thing to consider is that when RDMAing to/from buffers, these buffers are allocated on a specific NUMA node. They could be spread across all the NUMA nodes. If the NUMA node the buffer is allocated on is far from the IB interface doing the RDMAing, then it would impact performance. When we were doing the MR testing we noticed a significant impact due to these NUMA penalties. Granted they were on large UV machine, but the same problem could be happening here as well.&lt;/p&gt;

&lt;p&gt;One thing to try is to restrict buffer allocation to NUMA node 5. Can we try this and see how it impacts performance.&lt;/p&gt;</comment>
                            <comment id="253422" author="sihara" created="Thu, 22 Aug 2019 01:07:37 +0000"  >&lt;p&gt;OK, here is a test result which configured only an CPT and allocates all cpus in numa node 5 into that CPT.&lt;br/&gt;
 like this&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib10(ib0)&quot;
options libcfs cpu_npartitions=1 cpu_pattern=&quot;0[40-47,104,111]&quot;

# cat /sys/kernel/debug/lnet/cpu_partition_table
0	: 40 41 42 43 44 45 46 47 104 111
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;LNET selftest performs 11GB/sec from single client to against either 2 or 4 servers, but IOR read still hit 4GB/sec against 2 OSS (not only two OSS, but also any number of multiple servers)&lt;br/&gt;
 If number of OSS reduced to 1, performance goes up 8GB/sec. This is exact same IOR results above.&lt;/p&gt;</comment>
                            <comment id="253807" author="sihara" created="Wed, 28 Aug 2019 23:54:00 +0000"  >&lt;p&gt;If another interface is added on client as multi-rail, read performance bumps up.&lt;br/&gt;
but, it&apos;s a bit strange.. if I added an interface which is same numa node as primary interface, performance doesn&apos;t scale. but if i added an interface which is different numa node from primary interface, performance improved.&lt;/p&gt;

&lt;p&gt;e.g. &lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;root@mds15:~# cat /sys/class/net/ib0/device/numa_node 
5
root@mds15:~# cat /sys/class/net/ib1/device/numa_node 
5
root@mds15:~# cat /sys/class/net/ib2/device/numa_node 
6
root@mds15:~# cat /sys/class/net/ib3/device/numa_node 
6
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;options lnet networks=&quot;o2ib10(ib0)&quot;
Max Read:  3881.91 MiB/sec (4070.48 MB/sec)

options lnet networks=&quot;o2ib10(ib0,ib1)&quot;
Max Read:  3193.72 MiB/sec (3348.86 MB/sec)

options lnet networks=&quot;o2ib10(ib0,ib2)&quot;
Max Read:  6110.81 MiB/sec (6407.65 MB/sec)
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="253862" author="ashehata" created="Thu, 29 Aug 2019 18:03:54 +0000"  >&lt;p&gt;I&apos;ve observed that if you add two ports on the same HCA as different interfaces to the LNet network there is no performance boost. Performance boost is only seen when you add two different physical HCA cards. Not 100% sure why that is.&lt;/p&gt;

&lt;p&gt;A read test would do an RDMA write from the server to the client. Have you tried a write selftest from the two servers to the client? I&apos;m wondering if you&apos;d get the 11GB/s performance in this case.&lt;/p&gt;</comment>
                            <comment id="253872" author="bobhawkins" created="Thu, 29 Aug 2019 20:39:53 +0000"  >&lt;p&gt;&lt;font color=&quot;#1f497d&quot;&gt;Perhaps this is why two ports on one HCA are not scaling?&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#1f497d&quot;&gt;Examine the &lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;card slot:&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#1f497d&quot;&gt;One PCIe gen3 lane has max electrical signaling bandwidth of 984.6 MB/s.&lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;One &#8220;PCIe gen3 x16&#8221; slot has sixteen lanes: 16 * 984.6 = 15.75 GB/s max (guaranteed not to exceed)&lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#1f497d&quot;&gt;And the dual-port HCA:&lt;br/&gt;
&lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;One dual-port EDR-IB card requires a x16 slot but &#8220;offers&#8221; two 100Gb/s (12.5 GB/s) ports.&lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;Data encoding allows 64 of 66 bits to be used; 2 bits are for error correction.&lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;12.5 GB/s max * (64/66) leaves 12.1 GB/s usable bandwidth for one port to run at full speed.&lt;br/&gt;
 &lt;/font&gt;&lt;/p&gt;

&lt;p&gt;&lt;font color=&quot;#1f497d&quot;&gt;Therefore, the 15.75 GB/s &#8220;x16&#8221; slot only allows one port to run at full 12.1 GB/s EDR-IB speed. &lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;Cabling both ports, and assigning LNETs to both ports, without LNET knowing how to apportion bandwidth among the two ports, seems problematic. &lt;/font&gt;&lt;font color=&quot;#1f497d&quot;&gt;The x16 slot only provides ~65% of the bandwidth required to run both ports at speed.&lt;/font&gt;&lt;/p&gt;</comment>
                            <comment id="253882" author="sihara" created="Thu, 29 Aug 2019 22:46:59 +0000"  >&lt;p&gt;Actually, I understood that there is PCI bandwdith limiation on dual port HCA, but 3-4GB is REALLY lower than expected single EDR bandwdith and I was suspecting something NUMA  or NUMA/IO or CPT related problem behind. I don&apos;t want to get higher bandwdith by number of HCA here, but I am trying number of configurations (e.g. increasing peers, pin CPT to interface, use an dedicate CPT, etc) since as I said before, we are getting better performance on single EPYC client, but once added another CPU, read performance drops.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                            <outwardlinks description="is related to ">
                                        <issuelink>
            <issuekey id="55434">LU-12194</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="33375" name="LU-12667-lnetselftest-results.txt" size="8708" author="sihara" created="Thu, 15 Aug 2019 14:23:01 +0000"/>
                            <attachment id="33376" name="lnet_selftest.sh" size="2236" author="sihara" created="Thu, 15 Aug 2019 14:23:01 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i00l9b:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>