Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • None
    • None
    • CentOS 7.3
    • 3
    • 9223372036854775807

    Description

      A drop in OPA LNet bandwidth has occurred since Lustre 2.10.0.

      # lctl --version
      lctl 2.10.0
      
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 23426.6428571429
      Client Write RPC/s: 11714.1428571429
      Client Read MiB/s: 11713.6164285714
      Client Write MiB/s: 1.78714285714286
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 23577.5714285714
      Client Write RPC/s: 11790.2857142857
      Client Read MiB/s: 11789.2135714286
      Client Write MiB/s: 1.79928571428571
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 23595.5714285714
      Client Write RPC/s: 11798.2857142857
      Client Read MiB/s: 11799.1114285714
      Client Write MiB/s: 1.8
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 21268.3571428571
      Client Write RPC/s: 10635.2142857143
      Client Read MiB/s: 1.62357142857143
      Client Write MiB/s: 10634.2071428571
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 22236.9285714286
      Client Write RPC/s: 11118.9285714286
      Client Read MiB/s: 1.69714285714286
      Client Write MiB/s: 11118.7914285714
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 22178.6428571429
      Client Write RPC/s: 11087.2142857143
      Client Read MiB/s: 1.69142857142857
      Client Write MiB/s: 11089.0557142857
      
      
      
      # lctl --version
      lctl 2.10.55_127_g063a83a
      
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 16879.5
      Client Write RPC/s: 8441.14285714286
      Client Read MiB/s: 8439.57857142857
      Client Write MiB/s: 1.28785714285714
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 21844
      Client Write RPC/s: 10923.2857142857
      Client Read MiB/s: 10922.4635714286
      Client Write MiB/s: 1.66714285714286
      ----------------------------------------------------------
      Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1M
      Client Read RPC/s: 21928.4285714286
      Client Write RPC/s: 10964.7857142857
      Client Read MiB/s: 10965.17
      Client Write MiB/s: 1.67357142857143
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 17288.2142857143
      Client Write RPC/s: 8645.07142857143
      Client Read MiB/s: 1.32
      Client Write MiB/s: 8643.84928571428
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 18382.8571428571
      Client Write RPC/s: 9192.92857142857
      Client Read MiB/s: 1.40214285714286
      Client Write MiB/s: 9191.25285714285
      ----------------------------------------------------------
      Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1M
      Client Read RPC/s: 14966.3571428571
      Client Write RPC/s: 7486.07142857143
      Client Read MiB/s: 1.14285714285714
      Client Write MiB/s: 7482.79071428571
      
      
      

      LNet configuration is:

      # cat /etc/lnet.conf
      net:
          - net type: o2ib1
            local NI(s):
              - nid: 10.2.0.40@o2ib1
                interfaces:
                    0: ib0
                tunables:
                    peer_timeout: 180
                    peer_credits: 128
                    peer_buffer_credits: 0
                    credits: 1024
                lnd tunables:
                    peercredits_hiw: 64
                    map_on_demand: 256
                    concurrent_sends: 256
                    fmr_pool_size: 2048
                    fmr_flush_trigger: 512
                    fmr_cache: 1
                    ntx: 2048
                    conns_per_peer: 2
                CPT: "[0,1]"
      
      

      OPA driver configuration is:

      # cat /etc/modprobe.d/hfi1.conf
      options hfi1 piothreshold=0 sge_copy_mode=2 wss_threshold=70
      
      

      Attachments

        Issue Links

          Activity

            [LU-10373] LNet OPA Performance Drop

            Issue fixed by patch for LU-10394

            jgmitter Joseph Gmitter (Inactive) added a comment - Issue fixed by patch for LU-10394

            Hi Ian,

            Thanks for verifying. And you are correct that prior to LU-10129, 256 fragments would be used. LU-10129 re-worked the map-on-demand code to behave more appropriately. But it's quiet interesting that using more fragments causes reduced performance.

            ashehata Amir Shehata (Inactive) added a comment - Hi Ian, Thanks for verifying. And you are correct that prior to LU-10129 , 256 fragments would be used. LU-10129 re-worked the map-on-demand code to behave more appropriately. But it's quiet interesting that using more fragments causes reduced performance.

            Hi Amir,

            I was monitoring the number of work requests of the transmit message when kiblnd_init_rdma finished. I pulled the latest master version, and have verified that only 1 RDMA fragment is being used with OPA. It appears that the LU-10129 patch resolved the issue I was seeing. Just to verify, I built Lustre from the commit before the LU-10129 patch, and confirmed I was seeing 256 RDMA fragments with 1M messages.

            I think we can close this ticket.

            iziemba Ian Ziemba (Inactive) added a comment - Hi Amir, I was monitoring the number of work requests of the transmit message when kiblnd_init_rdma finished. I pulled the latest master version, and have verified that only 1 RDMA fragment is being used with OPA. It appears that the LU-10129 patch resolved the issue I was seeing. Just to verify, I built Lustre from the commit before the LU-10129 patch, and confirmed I was seeing 256 RDMA fragments with 1M messages. I think we can close this ticket.

            Can you let me know how you determined it's using 256 fragments? Did you conclude that by looking at the map_on_demand value in the stats?

            This value indicates the maximum number of fragments being negotiated between the peers. However with OPA we should always be collapsing everything in one fragment.

            There is also a cray OPA bugzzilla open that indicates that there is a ~2GB/s performance drop between different IFS versions: Bug 142506.

            Is this the same issue?

            ashehata Amir Shehata (Inactive) added a comment - Can you let me know how you determined it's using 256 fragments? Did you conclude that by looking at the map_on_demand value in the stats? This value indicates the maximum number of fragments being negotiated between the peers. However with OPA we should always be collapsing everything in one fragment. There is also a cray OPA bugzzilla open that indicates that there is a ~2GB/s performance drop between different IFS versions: Bug 142506. Is this the same issue?
            iziemba Ian Ziemba (Inactive) added a comment - - edited

            Here is the the latest data I have with CentOS 7.4. Note that Lustre 2.10.2 does not experience the issues the performance issues that master does.

            [root@client01 lst_performance]# uname -r
            3.10.0-693.11.1.el7.x86_64
            [root@client01 lst_performance]# lctl --version
            lctl 2.10.2
            [root@client01 lst_performance]# opaconfig -V
            10.6.1.0.2
            
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 16600.1428571429
            Client Write RPC/s: 8301.85714285714
            Client Read MiB/s: 8299.86857142857
            Client Write MiB/s: 1.26785714285714
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 16048.0714285714
            Client Write RPC/s: 8025.28571428571
            Client Read MiB/s: 8023.49428571428
            Client Write MiB/s: 1.22428571428571
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 16942.7857142857
            Client Write RPC/s: 8471.21428571429
            Client Read MiB/s: 8471.78357142857
            Client Write MiB/s: 1.29428571428571
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 21703.3571428571
            Client Write RPC/s: 10852.9285714286
            Client Read MiB/s: 1.65571428571429
            Client Write MiB/s: 10851.7657142857
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 21922.0714285714
            Client Write RPC/s: 10961.4285714286
            Client Read MiB/s: 1.67214285714286
            Client Write MiB/s: 10961.2514285714
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 21912.2857142857
            Client Write RPC/s: 10953.8571428571
            Client Read MiB/s: 1.67071428571429
            Client Write MiB/s: 10956.0221428571
            
            
            
            
            [root@client01 lst_performance]# uname -r
            3.10.0-693.11.1.el7.x86_64
            [root@client01 lst_performance]# lctl --version
            lctl 2.10.56_39_gbe4507f
            [root@client01 lst_performance]# opaconfig -V
            10.6.1.0.2
            
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 14908.8571428571
            Client Write RPC/s: 7456
            Client Read MiB/s: 7453.895
            Client Write MiB/s: 1.13928571428571
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 14782.8571428571
            Client Write RPC/s: 7393.5
            Client Read MiB/s: 7390.86071428571
            Client Write MiB/s: 1.12928571428571
            ----------------------------------------------------------
            Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m
            Client Read RPC/s: 14793.1428571429
            Client Write RPC/s: 7397.5
            Client Read MiB/s: 7396.55285714286
            Client Write MiB/s: 1.13
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 14475.2857142857
            Client Write RPC/s: 7238.64285714286
            Client Read MiB/s: 1.10642857142857
            Client Write MiB/s: 7237.25142857143
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 18805
            Client Write RPC/s: 9403.14285714286
            Client Read MiB/s: 1.43428571428571
            Client Write MiB/s: 9402.445
            ----------------------------------------------------------
            Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m
            Client Read RPC/s: 14235
            Client Write RPC/s: 7115.71428571429
            Client Read MiB/s: 1.08714285714286
            Client Write MiB/s: 7116.90714285714
            
            
            
            

            In addition, I am seeing ECONNABORTED with Lustre master that I do not see with 2.10:

            00000800:00000100:0.0F:1513796499.976702:0:117:0:(o2iblnd_cb.c:1920:kiblnd_close_conn_locked()) Closing conn to 10.2.0.40@o2ib1: error 0(waiting)
            00000400:00000100:11.0F:1513796499.977076:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 1
            00000400:00000100:11.0:1513796499.977081:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 2
            00000001:00020000:13.0F:1513796499.977088:0:2329:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103
            00000400:00000100:11.0:1513796499.977114:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 3
            00000400:00000100:11.0:1513796499.977116:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 4
            00000001:00020000:1.0F:1513796499.977122:0:2325:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103
            00000400:00000100:1.0:1513796499.977125:0:2325:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC ffff881049cd9400 done: service brw_test, peer 12345-10.2.0.40@o2ib1, status SWI_STATE_BULK_STARTED:-5
            00000001:00020000:1.0:1513796499.977128:0:2325:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.2.0.40@o2ib1 has failed: -5
            00000400:00000100:19.0F:1513796499.977146:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 5
            00000400:00000100:19.0:1513796499.977149:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 6
            00000001:00020000:5.0F:1513796499.977155:0:2330:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103
            
            
            
            iziemba Ian Ziemba (Inactive) added a comment - - edited Here is the the latest data I have with CentOS 7.4. Note that Lustre 2.10.2 does not experience the issues the performance issues that master does. [root@client01 lst_performance]# uname -r 3.10.0-693.11.1.el7.x86_64 [root@client01 lst_performance]# lctl --version lctl 2.10.2 [root@client01 lst_performance]# opaconfig -V 10.6.1.0.2 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 16600.1428571429 Client Write RPC/s: 8301.85714285714 Client Read MiB/s: 8299.86857142857 Client Write MiB/s: 1.26785714285714 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 16048.0714285714 Client Write RPC/s: 8025.28571428571 Client Read MiB/s: 8023.49428571428 Client Write MiB/s: 1.22428571428571 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 16942.7857142857 Client Write RPC/s: 8471.21428571429 Client Read MiB/s: 8471.78357142857 Client Write MiB/s: 1.29428571428571 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 21703.3571428571 Client Write RPC/s: 10852.9285714286 Client Read MiB/s: 1.65571428571429 Client Write MiB/s: 10851.7657142857 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 21922.0714285714 Client Write RPC/s: 10961.4285714286 Client Read MiB/s: 1.67214285714286 Client Write MiB/s: 10961.2514285714 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 21912.2857142857 Client Write RPC/s: 10953.8571428571 Client Read MiB/s: 1.67071428571429 Client Write MiB/s: 10956.0221428571 [root@client01 lst_performance]# uname -r 3.10.0-693.11.1.el7.x86_64 [root@client01 lst_performance]# lctl --version lctl 2.10.56_39_gbe4507f [root@client01 lst_performance]# opaconfig -V 10.6.1.0.2 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 32 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 14908.8571428571 Client Write RPC/s: 7456 Client Read MiB/s: 7453.895 Client Write MiB/s: 1.13928571428571 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 64 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 14782.8571428571 Client Write RPC/s: 7393.5 Client Read MiB/s: 7390.86071428571 Client Write MiB/s: 1.12928571428571 ---------------------------------------------------------- Running test: lst add_test --batch rperf --concurrency 128 --distribute 1:1 --from clients --to servers brw read size=1m Client Read RPC/s: 14793.1428571429 Client Write RPC/s: 7397.5 Client Read MiB/s: 7396.55285714286 Client Write MiB/s: 1.13 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 32 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 14475.2857142857 Client Write RPC/s: 7238.64285714286 Client Read MiB/s: 1.10642857142857 Client Write MiB/s: 7237.25142857143 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 64 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 18805 Client Write RPC/s: 9403.14285714286 Client Read MiB/s: 1.43428571428571 Client Write MiB/s: 9402.445 ---------------------------------------------------------- Running test: lst add_test --batch wperf --concurrency 128 --distribute 1:1 --from clients --to servers brw write size=1m Client Read RPC/s: 14235 Client Write RPC/s: 7115.71428571429 Client Read MiB/s: 1.08714285714286 Client Write MiB/s: 7116.90714285714 In addition, I am seeing ECONNABORTED with Lustre master that I do not see with 2.10: 00000800:00000100:0.0F:1513796499.976702:0:117:0:(o2iblnd_cb.c:1920:kiblnd_close_conn_locked()) Closing conn to 10.2.0.40@o2ib1: error 0(waiting) 00000400:00000100:11.0F:1513796499.977076:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 1 00000400:00000100:11.0:1513796499.977081:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 2 00000001:00020000:13.0F:1513796499.977088:0:2329:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103 00000400:00000100:11.0:1513796499.977114:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 3 00000400:00000100:11.0:1513796499.977116:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 4 00000001:00020000:1.0F:1513796499.977122:0:2325:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103 00000400:00000100:1.0:1513796499.977125:0:2325:0:(rpc.c:905:srpc_server_rpc_done()) Server RPC ffff881049cd9400 done: service brw_test, peer 12345-10.2.0.40@o2ib1, status SWI_STATE_BULK_STARTED:-5 00000001:00020000:1.0:1513796499.977128:0:2325:0:(brw_test.c:389:brw_server_rpc_done()) Bulk transfer from 12345-10.2.0.40@o2ib1 has failed: -5 00000400:00000100:19.0F:1513796499.977146:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 5, RPC errors 5 00000400:00000100:19.0:1513796499.977149:0:2289:0:(rpc.c:1418:srpc_lnet_ev_handler()) LNet event status -103 type 3, RPC errors 6 00000001:00020000:5.0F:1513796499.977155:0:2330:0:(brw_test.c:415:brw_bulk_ready()) BRW bulk WRITE failed for RPC from 12345-10.2.0.40@o2ib1: -103

            Doug - That is what I seeing. A single RDMA fragment (Lustre 2.10) does perform much better than 256 RDMA fragments (Lustre master). Sorry if my prior comment did not make that clear.

            iziemba Ian Ziemba (Inactive) added a comment - Doug - That is what I seeing. A single RDMA fragment (Lustre 2.10) does perform much better than 256 RDMA fragments (Lustre master). Sorry if my prior comment did not make that clear.

            That's strange. I would have thought switching from 256 fragments to one would be better for performance.

            dougo Doug Oucharek (Inactive) added a comment - That's strange. I would have thought switching from 256 fragments to one would be better for performance.

            It looks like with master, 256 RDMA fragments are used for a 1M OPA LNet transfer whereas Lustre 2.10 used a single RDMA fragment. Could this be a possible reason for the performance drop?

            iziemba Ian Ziemba (Inactive) added a comment - It looks like with master, 256 RDMA fragments are used for a 1M OPA LNet transfer whereas Lustre 2.10 used a single RDMA fragment. Could this be a possible reason for the performance drop?
            pjones Peter Jones added a comment -

            Amir

            Please can you advise

            Peter

            pjones Peter Jones added a comment - Amir Please can you advise Peter

            For Lustre 2.10.0 results, map_on_demand was set to 32.

            iziemba Ian Ziemba (Inactive) added a comment - For Lustre 2.10.0 results, map_on_demand was set to 32.

            People

              ashehata Amir Shehata (Inactive)
              iziemba Ian Ziemba (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: