Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12385

2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63

Details

    • 3
    • 9223372036854775807

    Description

      Hello,
      I've been doing some more benchmark testing for IO500 today and have been reliably hitting what appears to be an LNET/RDMA/Network problem since upgrading our servers and clients to Lustre 2.12.2.

      The test environment is:

      Server Environment
      # 24 Identical servers, each server contains 12x NVMe SSDs.
      # Each server is both MDS and OSS
      lustre-2.12.2-1.el7.x86_64
      kernel-3.10.0-957.10.1.el7_lustre.x86_64
      
      # Fabric is Intel OPA, using distro's in-kernel OFED
      # Each server has 2x OPA cards, configured with LNET multi-rail
      [root@dac-e-1 ~]# lnetctl net show
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
          - net type: o2ib1
            local NI(s):
              - nid: 10.47.18.1@o2ib1
                status: up
                interfaces:
                    0: ib0
              - nid: 10.47.18.25@o2ib1
                status: up
                interfaces:
                    0: ib1
      
      # LNET tunables
              tunables:
                    peer_timeout: 180
                    peer_credits: 128
                    peer_buffer_credits: 0
                    credits: 1024
                    peercredits_hiw: 64
                    map_on_demand: 256
                    concurrent_sends: 256
                    fmr_pool_size: 2048
                    fmr_flush_trigger: 512
                    fmr_cache: 1
                    ntx: 2048
                    conns_per_peer: 4
      
      Client Environment
      # Test involves 10 clients
      lustre-client-2.12.2-1.el7.x86_64
      kernel-3.10.0-957.10.1.el7.x86_64
      
      # Client has 1x Intel OPA card
          - net type: o2ib1
            local NI(s):
              - nid: 10.47.20.61@o2ib1
                status: up
                interfaces:
                    0: ib0
                statistics:
                    send_count: 21957610
                    recv_count: 29253267
                    drop_count: 2
                tunables:
                    peer_timeout: 180
                    peer_credits: 128
                    peer_buffer_credits: 0
                    credits: 1024
                    peercredits_hiw: 64
                    map_on_demand: 256
                    concurrent_sends: 256
                    fmr_pool_size: 2048
                    fmr_flush_trigger: 512
                    fmr_cache: 1
                    ntx: 2048
                    conns_per_peer: 4
                lnd tunables:
                dev cpt: 0
                tcp bonding: 0
                CPT: "[0,1]"
      

      I'm testing with the 'ior_easy' phase of IO500, which just writes a single file per-process to achieve a maximum bandwidth number.

      The write phase of the test completes fine, however after a few minutes of the read phase, the clients start to show LNetError messages such as:

      Client Errors
      Jun 04 19:47:08 cpu-e-1149 kernel: Lustre: Mounted fs1-client
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 0 seconds
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Skipped 1 previous similar message
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.47.18.14@o2ib1 (5): c: 0, oc: 0, rc: 63
      Jun 04 20:01:54 cpu-e-1149 kernel: LNetError: 19644:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Skipped 1 previous similar message
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1559674918/real 1559674924]  req@ffff8ddc96787500 x1635429905047040/t0(0) o3->fs1-OST0089-osc-ffff8df3bf0c8000@10.47.18.12@o2ib1:6/4 lens 488/440 e 0 to 1 dl 1559674962 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: 19805:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 36 previous similar messages
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection to fs1-OST0089 (at 10.47.18.12@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: fs1-OST0089-osc-ffff8df3bf0c8000: Connection restored to 10.47.18.12@o2ib1 (at 10.47.18.12@o2ib1)
      Jun 04 20:02:04 cpu-e-1149 kernel: Lustre: Skipped 4 previous similar messages
      Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) peer 10.47.18.10@o2ib1 - queue depth reduced from 128 to 63  to allow for qp creation
      Jun 04 20:02:10 cpu-e-1149 kernel: LNet: 34294:0:(o2iblnd.c:941:kiblnd_create_conn()) Skipped 83 previous similar messagess
      

      And the servers:

      Server Errors
      Jun 04 20:02:08 dac-e-10 kernel: Lustre: fs1-OST006f: Client 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1) reconnecting
      Jun 04 20:02:08 dac-e-10 kernel: Lustre: Skipped 1 previous similar message
      Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) @@@ truncated bulk READ 0(1048576)  req@ffff8b787babd050 x1635429904859568/t0(0) o3->899ea795-c4b3-f3f8-43dd-db593fc06301@10.47.21.126@o2ib1:527/0 lens 488/440 e 3 to 0 dl 1559674997 ref 1 fl Interpret:/0/0 rc 0/0
      Jun 04 20:02:59 dac-e-10 kernel: LustreError: 303530:0:(ldlm_lib.c:3268:target_bulk_io()) Skipped 1 previous similar message
      Jun 04 20:02:59 dac-e-10 kernel: Lustre: fs1-OST0074: Bulk IO read error with 899ea795-c4b3-f3f8-43dd-db593fc06301 (at 10.47.21.126@o2ib1), client will retry: rc -110
      Jun 04 20:02:59 dac-e-10 kernel: Lustre: Skipped 6 previous similar messages
      

      I've attached log files to the ticket with full logs for all the servers and the clients.

      The IOR output with command used is:

      IOR-3.3alpha1: MPI Coordinated Test of Parallel I/O
      Began               : Tue Jun  4 19:56:12 2019
      Command line        : /home/mjr208/projects/benchmarking/io-500-src-stonewall-fix/bin/ior -r -R -a POSIX --posix.odirect -C -t 1m -b 140g -F -i 1 -C -Q 1 -g -G 27 -k -e -o /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy -O stoneWallingStatusFile=/dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/stonewall
      Machine             : Linux cpu-e-829
      TestID              : 0
      StartTime           : Tue Jun  4 19:56:12 2019
      Path                : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy
      FS                  : 412.6 TiB   Used FS: 8.5%   Inodes: 960.0 Mi   Used Inodes: 0.0%
      
      Options: 
      api                 : POSIX
      apiVersion          : 
      test filename       : /dac/fs1/mjr208/job12066244-2019-06-04-1950/ior_easy/ior_file_easy
      access              : file-per-process
      type                : independent
      segments            : 1
      ordering in a file  : sequential
      tasks               : 320
      clients per node    : 32
      repetitions         : 1
      xfersize            : 1 MiB
      blocksize           : 140 GiB
      aggregate filesize  : 43.75 TiB
      
      Results: 
      
      access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
      ------    ---------  ---------- ---------  --------   --------   --------   --------   ----
      WARNING: Expected aggregate file size       = 48103633715200.
      WARNING: Stat() of aggregate file size      = 38543976038400.
      WARNING: Using actual aggregate bytes moved = 38543976038400.
      read      16882      146800640  1024.00    0.023208   2177.30    0.023764   2177.33    0   
      Max Read:  16882.35 MiB/sec (17702.42 MB/sec)
      
      Summary of all tests:
      Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
      read        16882.35   16882.35   16882.35       0.00   16882.35   16882.35   16882.35       0.00 2177.32777     0    320  32    1   1     2        1         0    0      1 150323855360  1048576 36758400.0 POSIX      0
      Finished            : Tue Jun  4 20:32:29 2019
      

      My main cause for suspicion here is that this same hardware and client combination was running the same benchmark this morning however at that time the servers were running Lustre 2.12.0 and the clients were on 2.10.7.

      Is there anything notable in the logs provided as to what might be going on?

      Thanks,
      Matt

      Attachments

        1. clients.20190604.log
          96 kB
        2. clients-peer-credits-128.log
          32 kB
        3. clients-peer-credits-16.run1.log
          10 kB
        4. clients-peer-credits-16.run2.log
          68 kB
        5. clients-peer-credits-16.run3.log
          47 kB
        6. qp_stats_client_good.txt
          69 kB
        7. qp_stats_server_good.txt
          89 kB
        8. Screenshot from 2019-06-05 14-20-52.png
          Screenshot from 2019-06-05 14-20-52.png
          285 kB
        9. servers.20190604.log
          462 kB
        10. servers-peer-credits-128.log
          73 kB
        11. servers-peer-credits-16.run1.log
          49 kB
        12. servers-peer-credits-16.run2.log
          233 kB
        13. servers-peer-credits-16.run3.log
          141 kB

        Issue Links

          Activity

            [LU-12385] 2.12.2: IOR read testing reliably inducing LNetError: Timed out RDMA with ...@o2ib1 (6): c: 0, oc: 0, rc: 63

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36252/
            Subject: LU-12385 lnet: update opa defaults
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: eaed79bd261177b27b9993aa2ccb167d381c1e36

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36252/ Subject: LU-12385 lnet: update opa defaults Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: eaed79bd261177b27b9993aa2ccb167d381c1e36

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36252
            Subject: LU-12385 lnet: update opa defaults
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 371ba5e6c74cfa3a709254a97905f3506092f446

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36252 Subject: LU-12385 lnet: update opa defaults Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 371ba5e6c74cfa3a709254a97905f3506092f446
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36072/
            Subject: LU-12385 lnet: update opa defaults
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 7f199dbf0261b89afe0dc8185db4403ae0efdefa

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36072/ Subject: LU-12385 lnet: update opa defaults Project: fs/lustre-release Branch: master Current Patch Set: Commit: 7f199dbf0261b89afe0dc8185db4403ae0efdefa
            pjones Peter Jones added a comment -

            Excellent - thanks!

            pjones Peter Jones added a comment - Excellent - thanks!

            Hi Peter,
            I've just run the test just now and can confirm that these three patches resolve the issue for me. I re-ran the test case against clients and servers running upstream 2.12.2 which showed the problem as expected. Then I took the current tip of branch b2_12, applied those three patches from gerrit on top, and re-ran the test which didn't show the issue.
            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Peter, I've just run the test just now and can confirm that these three patches resolve the issue for me. I re-ran the test case against clients and servers running upstream 2.12.2 which showed the problem as expected. Then I took the current tip of branch b2_12, applied those three patches from gerrit on top, and re-ran the test which didn't show the issue. Cheers, Matt
            pjones Peter Jones added a comment -

            Great - thanks Matt!

            pjones Peter Jones added a comment - Great - thanks Matt!

            Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test.
            Cheers,
            Matt

            mrb Matt Rásó-Barnett (Inactive) added a comment - Hi Peter, yes sorry - I can get this done before the weekend for sure, I'll send an update tomorrow once I've done the test. Cheers, Matt
            pjones Peter Jones added a comment -

            Matt

            Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon.

            Thanks

            Peter

            pjones Peter Jones added a comment - Matt Do you expect to get to this shortly? I understand that you are probably pressed with other priorities (not least prepping for your LAD talk) but I'd like to move to plan B for how we handle this issue for 2.12.3 if we're not going to be able to validate whether this proposed fix works soon. Thanks Peter
            pjones Peter Jones added a comment -

            If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it.

            pjones Peter Jones added a comment - If you can that would be perfect but testing on master would work too. Basically, as long as you can run without the patches and reproduce this behaviour then run with them and avoid it.

            People

              ashehata Amir Shehata (Inactive)
              mrb Matt Rásó-Barnett (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: