Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12385

2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63

Details

    • 3
    • 9223372036854775807

    Description

      Hello,
      I've been doing some more benchmark testing for IO500 today and have been reliably hitting what appears to be an LNET/RDMA/Network problem since upgrading our servers and clients to Lustre 2.12.2.

      The test environment is:

      Server Environment
      # 24 Identical servers, each server contains 12x NVMe SSDs.
      # Each server is both MDS and OSS
      lustre-2.12.2-1.el7.x86_64
      kernel-3.10.0-957.10.1.el7_lustre.x86_64
      
      # Fabric is Intel OPA, using distro's in-kernel OFED
      # Each server has 2x OPA cards, configured with LNET multi-rail
      [root@dac-e-1 ~]# lnetctl net show
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
          - net type: o2ib1
            local NI(s):
              - nid: 10.47.18.1@o2ib1
                status: up
                interfaces:
                    0: ib0
              - nid: 10.47.18.25@o2ib1
                status: up
                interfaces:
                    0: ib1
      
      # LNET tunables
              tunables:
                    peer_timeout: 180
                    peer_credits: 128
                    peer_buffer_credits: 0
                    credits: 1024
                    peercredits_hiw: 64
                    map_on_demand: 256
                    concurrent_sends: 256
                    fmr_pool_size: 2048
                    fmr_flush_trigger: 512
                    fmr_cache: 1
                    ntx: 2048
                    conns_per_peer: 4
      
      Client Environment
      # Test involves 10 clients
      lustre-client-2.12.2-1.el7.x86_64
      kernel-3.10.0-957.10.1.el7.x86_64
      
      # Client has 1x Intel OPA card
          - net type: o2ib1
            local NI(s):
              - nid: 10.47.20.61@o2ib1
                status: up
                interfaces:
                    0: ib0
                statistics:
                    send_count: 21957610
                    recv_count: 29253267
                    drop_count: 2
                tunables:
                    peer_timeout: 180
                    peer_credits: 128
                    peer_buffer_credits: 0
                    credits: 1024
                    peercredits_hiw: 64
                    map_on_demand: 256
                    concurrent_sends: 256
                    fmr_pool_size: 2048
                    fmr_flush_trigger: 512
                    fmr_cache: 1
                    ntx: 2048
                    conns_per_peer: 4
                lnd tunables:
                dev cpt: 0
                tcp bonding: 0
                CPT: "[0,1]"
      

      I'm testing with the 'ior_easy' phase of IO500, which just writes a single file per-process to achieve a maximum bandwidth number.

      The write phase of the test completes fine, however after a few minutes of the read phase, the clients start to show LNetError messages such as:

      Client Errors
      Jun 04 19:47:08 cpu-e-1149 kernel: Lustre: Mounted fs1-client
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924]  req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1)
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
      Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63  to allow for qp creation
      Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 83 previous similar messagess
      

      And the servers:

      Server Errors
      Jun 04 20:02:08 dac-e-10 kernel: Lustre: fs1-OST006f: Client 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1) reconnecting
      Jun 04 20:02:08 dac-e-10 kernel: Lustre: Skipped 1 previous similar message
      Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(1048576)  req@ffff8b787babd050 x1635429904859568/t0(0) o3->899ea795-c4b3-f3f8-43dd-db593fc06301@10.47.21.126@o2ib1:527/0 lens 488/440 e 3 to 0 dl 1559674997 ref 1 fl Interpret:/0/0 rc 0/0
      Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) Skipped 1 previous similar message
      Jun 04 20:02:59 dac-e-10 kernel: Lustre: fs1-OST0074: Bulk IO read error with 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1), client will retry: rc -110
      Jun 04 20:02:59 dac-e-10 kernel: Lustre: Skipped 6 previous similar messages
      

      I've attached log files to the ticket with full logs for all the servers and the clients.

      The IOR output with command used is:

      IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O
      Began               : Tue Jun  4 19:56:12 2019
      Command line        : /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/ior -r -R -a POSIX --posix.odirect -C -t 1m -b 140g -F -i 1 -C -Q 1 -g -G 27 -k -e -o /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy -O stoneWallingStatusFile=/dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/stonewall
      Machine             : Linux cpu-e-829
      TestID              : 0
      StartTime           : Tue Jun  4 19:56:12 2019
      Path                : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy
      FS                  : 412.6 TiB   Used FS: 8.5%   Inodes: 960.0 Mi   Used Inodes: 0.0%
      
      Options: 
      api                 : POSIX
      apiVersion          : 
      test filename       : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy
      access              : file-per-process
      type                : independent
      segments            : 1
      ordering in a file  : sequential
      tasks               : 320
      clients per node    : 32
      repetitions         : 1
      xfersize            : 1 MiB
      blocksize           : 140 GiB
      aggregate filesize  : 43.75 TiB
      
      Results: 
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      WARNING: Expected aggregate file size       = 48103633715200.
      WARNING: Stat() of aggregate file size      = 38543976038400.
      WARNING: Using actual aggregate bytes moved = 38543976038400.
      read      16882      146800640  1024.00    0.023208   2177.30    0.023764   2177.33    0   
      Max Read:  16882.35 MiB/sec (17702.42 MB/sec)
      
      Summary of all tests:
      Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
      read        16882.35   16882.35   16882.35       0.00   16882.35   16882.35   16882.35       0.00 2177.32777     0    320  32    1   1     2        1         0    0      1 150323855360  1048576 36758400.0 POSIX      0
      Finished            : Tue Jun  4 20:32:29 2019
      

      My main cause for suspicion here is that this same hardware and client combination was running the same benchmark this morning however at that time the servers were running Lustre 2.12.0 and the clients were on 2.10.7.

      Is there anything notable in the logs provided as to what might be going on?

      Thanks,
      Matt

      Attachments

        1. clients.20190604.log
          96 kB
        2. clients-peer-credits-128.log
          32 kB
        3. clients-peer-credits-16.run1.log
          10 kB
        4. clients-peer-credits-16.run2.log
          68 kB
        5. clients-peer-credits-16.run3.log
          47 kB
        6. qp_stats_client_good.txt
          69 kB
        7. qp_stats_server_good.txt
          89 kB
        8. Screenshot from 2019-06-05 14-20-52.png
          Screenshot from 2019-06-05 14-20-52.png
          285 kB
        9. servers.20190604.log
          462 kB
        10. servers-peer-credits-128.log
          73 kB
        11. servers-peer-credits-16.run1.log
          49 kB
        12. servers-peer-credits-16.run2.log
          233 kB
        13. servers-peer-credits-16.run3.log
          141 kB

        Issue Links

          Activity

            [LU-12385] 2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63
            pjones Peter Jones added a comment -

            Excellent - thanks!

            pjones Peter Jones added a comment - Excellent - thanks!

            Hi Peter,
            I've just run the test just now and can confirm that these three patches resolve the issue for me. I re-ran the test case against clients and servers running upstream 2.12.2 which showed the problem as expected. Then I took the current tip of branch b2_12, applied those three patches from gerrit on top, and re-ran the test which didn't show the issue.
            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Peter, I've just run the test just now and can confirm that these three patches resolve the issue for me. I re-ran the test case against clients and servers running upstream 2.12.2 which showed the problem as expected. Then I took the current tip of branch b2_12, applied those three patches from gerrit on top, and re-ran the test which didn't show the issue. Cheers, Matt
            pjones Peter Jones added a comment -

            Great - thanks Matt!

            pjones Peter Jones added a comment - Great - thanks Matt!

            Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test.
            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test. Cheers, Matt
            pjones Peter Jones added a comment -

            Matt

            Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon.

            Thanks

            Peter

            pjones Peter Jones added a comment - Matt Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon. Thanks Peter
            pjones Peter Jones added a comment -

            If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it.

            pjones Peter Jones added a comment - If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it.

            Hi Peter, yes I'll give this a test - shall I apply these on top of branch b2_12 for testing?
            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Peter, yes I'll give this a test - shall I apply these on top of branch b2_12 for testing? Cheers, Matt
            pjones Peter Jones added a comment -

            mrb are you able to verify whether these three patches do as hoped address this issue?

            pjones Peter Jones added a comment - mrb are you able to verify whether these three patches do as hoped address this issue?

            Hi Matt,

            There are 3 patches which should resolve this issue:

            https://review.whamcloud.com/36072 LU-12385 lnet: update opa defaults
            https://review.whamcloud.com/36073 LU-12621 o2iblnd: cache max_qp_wr
            https://review.whamcloud.com/35578 LU-12569 o2iblnd: Make credits hiw connection aware
            

            LU-12385 based on your testing I changed the defaults to use peer_credits 32 and peer_credits_hiw 16

            LU-12621 Change the algorithm to cache the maximum number of work requests per qp which a device can handle and make sure we do not exceed it when creating the qp.

            LU-12569 If we do endup reducing the queue depth makes sure to use the negotiated queue depth when calculating the hiw

            ashehata Amir Shehata (Inactive) added a comment - Hi Matt, There are 3 patches which should resolve this issue: https: //review.whamcloud.com/36072 LU-12385 lnet: update opa defaults https: //review.whamcloud.com/36073 LU-12621 o2iblnd: cache max_qp_wr https: //review.whamcloud.com/35578 LU-12569 o2iblnd: Make credits hiw connection aware LU-12385 based on your testing I changed the defaults to use peer_credits 32 and peer_credits_hiw 16 LU-12621 Change the algorithm to cache the maximum number of work requests per qp which a device can handle and make sure we do not exceed it when creating the qp. LU-12569 If we do endup reducing the queue depth makes sure to use the negotiated queue depth when calculating the hiw

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36072
            Subject: LU-12385 lnet: update opa defaults
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 0b74daac778694097a445b92812ef8c99ffb1881

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36072 Subject: LU-12385 lnet: update opa defaults Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 0b74daac778694097a445b92812ef8c99ffb1881

            I tried with peer_credits=128, peer_credits_hiw=64 and max_qp_wrs=32896 over the weekend and this also appeared to fix the problem. Read performance was again similar to the other configurations tried with lower peer_credits.

            So it seems that this hfi module param increase is also a fix to this issue.

            mrb Matt Rásó-Barnett (Inactive) added a comment - I tried with peer_credits=128, peer_credits_hiw=64 and max_qp_wrs=32896 over the weekend and this also appeared to fix the problem. Read performance was again similar to the other configurations tried with lower peer_credits. So it seems that this hfi module param increase is also a fix to this issue.

            People

              ashehata Amir Shehata (Inactive)
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: