[LU-10373] LNet OPA Performance Drop Created: 12/Dec/17  Updated: 17/Jan/18  Resolved: 17/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Ian Ziemba Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.3


Issue Links:
Related
is related to LU-10394 IB_MR_TYPE_SG_GAPS mlx5 LNet performa... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A drop in OPA LNet bandwidth has occurred since Lustre 2.10.0.

# lctl --version
lctl 2.10.0

----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 23426.6428571429
Client Write RPC/s: 11714.1428571429
Client Read MiB/s: 11713.6164285714
Client Write MiB/s: 1.78714285714286
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 23577.5714285714
Client Write RPC/s: 11790.2857142857
Client Read MiB/s: 11789.2135714286
Client Write MiB/s: 1.79928571428571
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 23595.5714285714
Client Write RPC/s: 11798.2857142857
Client Read MiB/s: 11799.1114285714
Client Write MiB/s: 1.8
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 21268.3571428571
Client Write RPC/s: 10635.2142857143
Client Read MiB/s: 1.62357142857143
Client Write MiB/s: 10634.2071428571
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 22236.9285714286
Client Write RPC/s: 11118.9285714286
Client Read MiB/s: 1.69714285714286
Client Write MiB/s: 11118.7914285714
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 22178.6428571429
Client Write RPC/s: 11087.2142857143
Client Read MiB/s: 1.69142857142857
Client Write MiB/s: 11089.0557142857


# lctl --version
lctl 2.10.55_127_g063a83a

----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 16879.5
Client Write RPC/s: 8441.14285714286
Client Read MiB/s: 8439.57857142857
Client Write MiB/s: 1.28785714285714
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 21844
Client Write RPC/s: 10923.2857142857
Client Read MiB/s: 10922.4635714286
Client Write MiB/s: 1.66714285714286
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1M
Client Read RPC/s: 21928.4285714286
Client Write RPC/s: 10964.7857142857
Client Read MiB/s: 10965.17
Client Write MiB/s: 1.67357142857143
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 17288.2142857143
Client Write RPC/s: 8645.07142857143
Client Read MiB/s: 1.32
Client Write MiB/s: 8643.84928571428
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 18382.8571428571
Client Write RPC/s: 9192.92857142857
Client Read MiB/s: 1.40214285714286
Client Write MiB/s: 9191.25285714285
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1M
Client Read RPC/s: 14966.3571428571
Client Write RPC/s: 7486.07142857143
Client Read MiB/s: 1.14285714285714
Client Write MiB/s: 7482.79071428571


LNet configuration is:

# cat /etc/lnet.conf
net:
    - net type: o2ib1
      local NI(s):
        - nid: 10.2.0.40@o2ib1
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 256
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 2
          CPT: "[0,1]"

OPA driver configuration is:

# cat /etc/modprobe.d/hfi1.conf
options hfi1 piothreshold=0 sge_copy_mode=2 wss_threshold=70



 Comments   
Comment by Ian Ziemba [ 12/Dec/17 ]

For Lustre 2.10.0 results, map_on_demand was set to 32.

Comment by Peter Jones [ 19/Dec/17 ]

Amir

Please can you advise

Peter

Comment by Ian Ziemba [ 20/Dec/17 ]

It looks like with master, 256 RDMA fragments are used for a 1M OPA LNet transfer whereas Lustre 2.10 used a single RDMA fragment. Could this be a possible reason for the performance drop?

Comment by Doug Oucharek (Inactive) [ 20/Dec/17 ]

That's strange. I would have thought switching from 256 fragments to one would be better for performance.

Comment by Ian Ziemba [ 20/Dec/17 ]

Doug - That is what I seeing. A single RDMA fragment (Lustre 2.10) does perform much better than 256 RDMA fragments (Lustre master). Sorry if my prior comment did not make that clear.

Comment by Ian Ziemba [ 20/Dec/17 ]

Here is the the latest data I have with CentOS 7.4. Note that Lustre 2.10.2 does not experience the issues the performance issues that master does.

[root@client01 lst_performance]# uname -r
3.10.0-693.11.1.el7.x86_64
[root@client01 lst_performance]# lctl --version
lctl 2.10.2
[root@client01 lst_performance]# opaconfig -V
10.6.1.0.2

----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 16600.1428571429
Client Write RPC/s: 8301.85714285714
Client Read MiB/s: 8299.86857142857
Client Write MiB/s: 1.26785714285714
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 16048.0714285714
Client Write RPC/s: 8025.28571428571
Client Read MiB/s: 8023.49428571428
Client Write MiB/s: 1.22428571428571
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 16942.7857142857
Client Write RPC/s: 8471.21428571429
Client Read MiB/s: 8471.78357142857
Client Write MiB/s: 1.29428571428571
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 21703.3571428571
Client Write RPC/s: 10852.9285714286
Client Read MiB/s: 1.65571428571429
Client Write MiB/s: 10851.7657142857
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 21922.0714285714
Client Write RPC/s: 10961.4285714286
Client Read MiB/s: 1.67214285714286
Client Write MiB/s: 10961.2514285714
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 21912.2857142857
Client Write RPC/s: 10953.8571428571
Client Read MiB/s: 1.67071428571429
Client Write MiB/s: 10956.0221428571



[root@client01 lst_performance]# uname -r
3.10.0-693.11.1.el7.x86_64
[root@client01 lst_performance]# lctl --version
lctl 2.10.56_39_gbe4507f
[root@client01 lst_performance]# opaconfig -V
10.6.1.0.2

----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 14908.8571428571
Client Write RPC/s: 7456
Client Read MiB/s: 7453.895
Client Write MiB/s: 1.13928571428571
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 14782.8571428571
Client Write RPC/s: 7393.5
Client Read MiB/s: 7390.86071428571
Client Write MiB/s: 1.12928571428571
----------------------------------------------------------
Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m
Client Read RPC/s: 14793.1428571429
Client Write RPC/s: 7397.5
Client Read MiB/s: 7396.55285714286
Client Write MiB/s: 1.13
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 14475.2857142857
Client Write RPC/s: 7238.64285714286
Client Read MiB/s: 1.10642857142857
Client Write MiB/s: 7237.25142857143
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 18805
Client Write RPC/s: 9403.14285714286
Client Read MiB/s: 1.43428571428571
Client Write MiB/s: 9402.445
----------------------------------------------------------
Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m
Client Read RPC/s: 14235
Client Write RPC/s: 7115.71428571429
Client Read MiB/s: 1.08714285714286
Client Write MiB/s: 7116.90714285714



In addition, I am seeing ECONNABORTED with Lustre master that I do not see with 2.10:

00000800:00000100:0.0F:1513796499.976702:0:117:0:(o2iblnd_cb.c:1920:kiblnd_close_conn_locked()) Closing conn to 10.2.0.40@o2ib1: error 0(waiting)
00000400:00000100:11.0F:1513796499.977076:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 1
00000400:00000100:11.0:1513796499.977081:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 2
00000001:00020000:13.0F:1513796499.977088:0:2329:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103
00000400:00000100:11.0:1513796499.977114:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 3
00000400:00000100:11.0:1513796499.977116:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 4
00000001:00020000:1.0F:1513796499.977122:0:2325:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103
00000400:00000100:1.0:1513796499.977125:0:2325:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC ffff881049cd9400 done: service brw_test, peer 12345-10.2.0.40@o2ib1, status SWI_STATE_BULK_STARTED:-5
00000001:00020000:1.0:1513796499.977128:0:2325:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.2.0.40@o2ib1 has failed: -5
00000400:00000100:19.0F:1513796499.977146:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 5
00000400:00000100:19.0:1513796499.977149:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 6
00000001:00020000:5.0F:1513796499.977155:0:2330:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103


Comment by Amir Shehata (Inactive) [ 05/Jan/18 ]

Can you let me know how you determined it's using 256 fragments? Did you conclude that by looking at the map_on_demand value in the stats?

This value indicates the maximum number of fragments being negotiated between the peers. However with OPA we should always be collapsing everything in one fragment.

There is also a cray OPA bugzzilla open that indicates that there is a ~2GB/s performance drop between different IFS versions: Bug 142506.

Is this the same issue?

Comment by Ian Ziemba [ 11/Jan/18 ]

Hi Amir,

I was monitoring the number of work requests of the transmit message when kiblnd_init_rdma finished. I pulled the latest master version, and have verified that only 1 RDMA fragment is being used with OPA. It appears that the LU-10129 patch resolved the issue I was seeing. Just to verify, I built Lustre from the commit before the LU-10129 patch, and confirmed I was seeing 256 RDMA fragments with 1M messages.

I think we can close this ticket.

Comment by Amir Shehata (Inactive) [ 11/Jan/18 ]

Hi Ian,

Thanks for verifying. And you are correct that prior to LU-10129, 256 fragments would be used. LU-10129 re-worked the map-on-demand code to behave more appropriately. But it's quiet interesting that using more fragments causes reduced performance.

Comment by Joseph Gmitter (Inactive) [ 17/Jan/18 ]

Issue fixed by patch for LU-10394

Generated at Sat Feb 10 02:34:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.