Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11230

QIB route to OPA LNet drops / selftest fail

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.4
    • OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5

    Description

      Hi Folks,

      Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail.

      In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers.

      For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine.

      We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets.

      Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics?

      Shortened example of failed selftest:

      [root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh
      LST_SESSION = 755
      SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
      192.168.55.77@o2ib are added to session
      192.168.44.199@o2ib44 are added to session
      Test was added successfully
      bulk_read is running now
      Capturing statistics for 300 secs [LNet Rates of lfrom]
      [R] Avg: 3163     RPC/s Min: 3163     RPC/s Max: 3163     RPC/s
      [W] Avg: 1580     RPC/s Min: 1580     RPC/s Max: 1580     RPC/s
      [LNet Bandwidth of lfrom]
      [R] Avg: 1581.81  MiB/s Min: 1581.81  MiB/s Max: 1581.81  MiB/s
      [W] Avg: 0.24     MiB/s Min: 0.24     MiB/s Max: 0.24     MiB/s
      
      etc...
      
      [LNet Bandwidth of lfrom]
      [R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
      [W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
      [LNet Rates of lto]
      [R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
      [W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
      [LNet Bandwidth of lto]
      [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
      [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
      
      lfrom:
      12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired]
      Total 1 error nodes in lfrom
      lto:
      Total 0 error nodes in lto
      Batch is stopped
      session is ended
      

      Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable.

      Cheers,
      Simon

      Attachments

        1. dk.log.xz
          6.23 MB
        2. lnet.conf
          2 kB
        3. lnetctl_export_lnet-router.txt
          16 kB
        4. lnetctl_export_opa.txt
          10 kB
        5. lnetctl_export_qlogic.txt
          4 kB
        6. lnet-tests-21_aug_2018.txt
          12 kB

        Activity

          [LU-11230] QIB route to OPA LNet drops / selftest fail
          pjones Peter Jones added a comment -

          Has this proposed meeting taken place yet?

          pjones Peter Jones added a comment - Has this proposed meeting taken place yet?
          scadmin SC Admin added a comment -

          Hi Amir,

          Yep. That time will work. I'll email you through some details for a meeting with prospective dates.

          Cheers,
          Simon

          scadmin SC Admin added a comment - Hi Amir, Yep. That time will work. I'll email you through some details for a meeting with prospective dates. Cheers, Simon

          Does 4pm PST, 9AM (your time) work? If so, let me know the date that works for you. Would need to be able to share screens or something of that sort to debug further.

           

          ashehata Amir Shehata (Inactive) added a comment - Does 4pm PST, 9AM (your time) work? If so, let me know the date that works for you. Would need to be able to share screens or something of that sort to debug further.  
          scadmin SC Admin added a comment -

          Hi Amir,

           

          > The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)?

          These peers are not relevant for the purposes of the lnet_selftest (from my understanding). They are however important for the purposes of actual file transfers though.. which is why we're going back to basic lnet_selftest's to verify the network between fabrics.

          The below peers are (respectively) MDS1, MDS2, OSS1 for home & apps. etc, OSS2 for home & apps. etc. There are another 8 x OSS's for the main large filesystem too not mentioned here but use IP's 192.168.44.13[1-8]@o2ib44:

          peer:
           - primary nid: 192.168.44.21@o2ib44
           - primary nid: 192.168.44.22@o2ib44
           - primary nid: 192.168.44.51@o2ib44
           - primary nid: 192.168.44.52@o2ib44
          

          > I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?

          Yeah, dmesg and /var/log/messages are really light for errors. The only errors that appear during the test period were what I pasted in. eg: the "failed with -103", and "failed with -110" examples.

          > Would you also be able to share the lnet-selftest script you're using?

          Yup. It's a pretty standard one: 

          #!/bin/sh
          #
          # Simple wrapper script for LNET Selftest
          #
          
          # Parameters are supplied as environment variables
          # The defaults are reasonable for quick verification.
          # For in-depth benchmarking, increase the time (TM)
          # variable to e.g. 60 seconds, and iterate over
          # concurrency to find optimal values.
          #
          # Reference: http://wiki.lustre.org/LNET_Selftest
          
          # Concurrency
          CN=${CN:-32}
          #Size
          SZ=${SZ:-1M}
          # Length of time to run test (secs)
          TM=${TM:-10}
          # Which BRW test to run (read or write)
          BRW=${BRW:-"read"}
          # Checksum calculation (simple or full)
          CKSUM=${CKSUM:-"simple"}
          
          # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
          # LFROM="10.10.2.21@tcp"
          LFROM=${LFROM:?ERROR: the LFROM variable is not set}
          # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
          # LTO="10.10.2.22@tcp"
          LTO=${LTO:?ERROR: the LTO variable is not set}
          
          ### End of customisation.
          
          export LST_SESSION=$$
          echo LST_SESSION = ${LST_SESSION}
          lst new_session lst${BRW}
          lst add_group lfrom ${LFROM}
          lst add_group lto ${LTO}
          lst add_batch bulk_${BRW}
          lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \
            --concurrency=${CN} check=${CKSUM} size=${SZ}
          lst run bulk_${BRW}
          echo -n "Capturing statistics for ${TM} secs "
          lst stat lfrom lto &
          LSTPID=$!
          # Delay loop with interval markers displayed every 5 secs.
          # Test time is rounded up to the nearest 5 seconds.
          i=1
          j=$((${TM}/5))
          if [ $((${TM}%5)) -ne 0 ]; then let j++; fi
          while [ $i -le $j ]; do
            sleep 5
            let i++
          done
          kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1
          echo
          lst show_error lfrom lto
          lst stop bulk_${BRW}
          lst end_session
          

          > If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node.

          In my testing I found that a Qlogic compute node to the Qlogic interface on the lnet router proved to be working reliably. The same goes for OPA compute nodes to the OPA interface on the lnet router - they worked just fine. In both cases though (now this is testing my memory!), if I had mismatched the ko2iblnd settings between a compute/routers respective fabric interfaces then I would get issues (depending which settings were mismatched).. but having them matched works just fine.

          Apart from the two Qlogic configs you just mentioned here. I'd also tested this configuration which also gave poor results with routing between fabric types. This is actually our current lnet setup on all Qlogic compute nodes with the exception of my test host / lnet router where I've been going through changing the parameters to try and figure this all out:

              - net type: o2ib
                local NI(s):
                  - nid: 192.168.55.75@o2ib
                    status: up
                    interfaces:
                        0: ib0
                    tunables:
                        peer_timeout: 180
                        peer_credits: 8
                        peer_buffer_credits: 0
                        credits: 256
                    lnd tunables:
                        peercredits_hiw: 4
                        map_on_demand: 0
                        concurrent_sends: 8
                        fmr_pool_size: 512
                        fmr_flush_trigger: 384
                        fmr_cache: 1
                        ntx: 512
                        conns_per_peer: 1
                    tcp bonding: 0
                    dev cpt: 1
                    CPT: "[0,1]"
          

          > Finally, would we be able to setup a live debug session?

          Not a problem at all. We're east coast Australia, I can setup a live session to help debug this if you want to pick a time that suits us both.

          Cheers,
          Simon

          scadmin SC Admin added a comment - Hi Amir,   > The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)? These peers are not relevant for the purposes of the lnet_selftest (from my understanding). They are however important for the purposes of actual file transfers though.. which is why we're going back to basic lnet_selftest's to verify the network between fabrics. The below peers are (respectively) MDS1, MDS2, OSS1 for home & apps. etc, OSS2 for home & apps. etc. There are another 8 x OSS's for the main large filesystem too not mentioned here but use IP's 192.168.44.13 [1-8] @o2ib44: peer: - primary nid: 192.168.44.21@o2ib44 - primary nid: 192.168.44.22@o2ib44 - primary nid: 192.168.44.51@o2ib44 - primary nid: 192.168.44.52@o2ib44 > I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted? Yeah, dmesg and /var/log/messages are really light for errors. The only errors that appear during the test period were what I pasted in. eg: the "failed with -103", and "failed with -110" examples. > Would you also be able to share the lnet-selftest script you're using? Yup. It's a pretty standard one:  #!/bin/sh # # Simple wrapper script for LNET Selftest # # Parameters are supplied as environment variables # The defaults are reasonable for quick verification. # For in-depth benchmarking, increase the time (TM) # variable to e.g. 60 seconds, and iterate over # concurrency to find optimal values. # # Reference: http: //wiki.lustre.org/LNET_Selftest # Concurrency CN=${CN:-32} #Size SZ=${SZ:-1M} # Length of time to run test (secs) TM=${TM:-10} # Which BRW test to run (read or write) BRW=${BRW:- "read" } # Checksum calculation (simple or full) CKSUM=${CKSUM:- "simple" } # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs. # LFROM= "10.10.2.21@tcp" LFROM=${LFROM:?ERROR: the LFROM variable is not set} # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs. # LTO= "10.10.2.22@tcp" LTO=${LTO:?ERROR: the LTO variable is not set} ### End of customisation. export LST_SESSION=$$ echo LST_SESSION = ${LST_SESSION} lst new_session lst${BRW} lst add_group lfrom ${LFROM} lst add_group lto ${LTO} lst add_batch bulk_${BRW} lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \ --concurrency=${CN} check=${CKSUM} size=${SZ} lst run bulk_${BRW} echo -n "Capturing statistics for ${TM} secs " lst stat lfrom lto & LSTPID=$! # Delay loop with interval markers displayed every 5 secs. # Test time is rounded up to the nearest 5 seconds. i=1 j=$((${TM}/5)) if [ $((${TM}%5)) -ne 0 ]; then let j++; fi while [ $i -le $j ]; do sleep 5 let i++ done kill ${LSTPID} && wait ${LISTPID} >/dev/ null 2>&1 echo lst show_error lfrom lto lst stop bulk_${BRW} lst end_session > If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node. In my testing I found that a Qlogic compute node to the Qlogic interface on the lnet router proved to be working reliably. The same goes for OPA compute nodes to the OPA interface on the lnet router - they worked just fine. In both cases though (now this is testing my memory!), if I had mismatched the ko2iblnd settings between a compute/routers respective fabric interfaces then I would get issues (depending which settings were mismatched).. but having them matched works just fine. Apart from the two Qlogic configs you just mentioned here. I'd also tested this configuration which also gave poor results with routing between fabric types. This is actually our current lnet setup on all Qlogic compute nodes with the exception of my test host / lnet router where I've been going through changing the parameters to try and figure this all out: - net type: o2ib local NI(s): - nid: 192.168.55.75@o2ib status: up interfaces: 0: ib0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 0 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 tcp bonding: 0 dev cpt: 1 CPT: "[0,1]" > Finally, would we be able to setup a live debug session? Not a problem at all. We're east coast Australia, I can setup a live session to help debug this if you want to pick a time that suits us both. Cheers, Simon
          ashehata Amir Shehata (Inactive) added a comment - - edited

          Hi Simon,

          peer:
              - primary nid: 192.168.44.21@o2ib44
                Multi-Rail: False
                peer ni:
                  - nid: 192.168.44.21@o2ib44
                    min_tx_credits: -4815
              - primary nid: 192.168.44.22@o2ib44
                Multi-Rail: False
                peer ni:
                  - nid: 192.168.44.22@o2ib44
                    min_tx_credits: -4868
              - primary nid: 192.168.44.51@o2ib44
                Multi-Rail: False
                peer ni:
                  - nid: 192.168.44.51@o2ib44
                    state: NA
                    min_tx_credits: -10849
              - primary nid: 192.168.44.52@o2ib44
                Multi-Rail: False
                peer ni:
                  - nid: 192.168.44.52@o2ib44
                    min_tx_credits: -12366
          

          The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)?

          I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?

          Would you also be able to share the lnet-selftest script you're using?

          Also for the QIB I see that you tried both of these configs:

                        peercredits_hiw: 64
                        map_on_demand: 32
                        concurrent_sends: 256
                        fmr_pool_size: 2048
                        fmr_flush_trigger: 512
                        fmr_cache: 1
                        ntx: 2048
                        conns_per_peer: 4 

          and

                        peercredits_hiw: 64
                        map_on_demand: 0
                        concurrent_sends: 256
                        fmr_pool_size: 2048
                        fmr_flush_trigger: 512
                        fmr_cache: 1
                        ntx: 2048
                        conns_per_peer: 1

          If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node.

          My preference though is to stick with conns_per_peer: 1 for QLOGIC. the conns_per_peer 4 was intended for OPA interfaces only.

          Finally, would we be able to setup a live debug session?

          thanks

          amir

          ashehata Amir Shehata (Inactive) added a comment - - edited Hi Simon, peer: - primary nid: 192.168.44.21@o2ib44 Multi-Rail: False peer ni: - nid: 192.168.44.21@o2ib44 min_tx_credits: -4815 - primary nid: 192.168.44.22@o2ib44 Multi-Rail: False peer ni: - nid: 192.168.44.22@o2ib44 min_tx_credits: -4868 - primary nid: 192.168.44.51@o2ib44 Multi-Rail: False peer ni: - nid: 192.168.44.51@o2ib44 state: NA min_tx_credits: -10849 - primary nid: 192.168.44.52@o2ib44 Multi-Rail: False peer ni: - nid: 192.168.44.52@o2ib44 min_tx_credits: -12366 The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)? I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted? Would you also be able to share the lnet-selftest script you're using? Also for the QIB I see that you tried both of these configs: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 and peercredits_hiw: 64 map_on_demand: 0 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 1 If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node. My preference though is to stick with conns_per_peer: 1 for QLOGIC. the conns_per_peer 4 was intended for OPA interfaces only. Finally, would we be able to setup a live debug session? thanks amir
          scadmin SC Admin added a comment -

          Hi Amir,

          I should add: there are no issues we can see with routes being marked down on either side or lctl pings failing.  In general, everything appears OK. I wasn't sure if a really short test would capture it, so a ran the standard 5 min test in which is failed maybe 30 second to a minute into the test. I've attached three configs and the dk log as requested.

          Cheers,

          Simon

           

          scadmin SC Admin added a comment - Hi Amir, I should add: there are no issues we can see with routes being marked down on either side or lctl pings failing.  In general, everything appears OK. I wasn't sure if a really short test would capture it, so a ran the standard 5 min test in which is failed maybe 30 second to a minute into the test. I've attached three configs and the dk log as requested. Cheers, Simon  

          Hi Simon,

          If you can get me the following info that would be great:

          1. Configuration from OPA node, router node and QLogic node (lnetctl export > config.yaml). Would be great if each one is in a separate file.
          2. Are you able to ping from the OPA -> QLOGIC and from QLOGIC -> OPA with no problem? (lnetctl ping <NID>). If you're encountering a failure with simple ping, let's turn on and capture the logging: lctl set_param debug=+"net neterror" THEN run ping test THEN lctl dk > log.dk.
          3. If problem is not reproducible via ping then, if you can turn on debugging as above run a short selftest run (which would contain errors) and then capture logging.

          thanks

          amir

          ashehata Amir Shehata (Inactive) added a comment - Hi Simon, If you can get me the following info that would be great: Configuration from OPA node, router node and QLogic node (lnetctl export > config.yaml). Would be great if each one is in a separate file. Are you able to ping from the OPA -> QLOGIC and from QLOGIC -> OPA with no problem? (lnetctl ping <NID>). If you're encountering a failure with simple ping, let's turn on and capture the logging: lctl set_param debug=+"net neterror" THEN run ping test THEN lctl dk > log.dk. If problem is not reproducible via ping then, if you can turn on debugging as above run a short selftest run (which would contain errors) and then capture logging. thanks amir
          scadmin SC Admin added a comment -

          Hi Guys,

          To update this: I went through all the scenarios doing a 5min selftests for each combination of eth/qdr/opa via our routers. This included tests between a node of each fabric type and the routers respective HCA/NIC and between nodes on different fabrics. The common factor in each failure event is the Qlogic HCA. We cannot reliably route between Qlogic and Ethernet or OPA. We can route fine between Ethernet and OPA / Ethernet. Failed selftests show up as this in dmesg/or message logs:

          Eg.

          QDR <-> OPA Test 2:
          LTO - OPA Compute node
          LFROM - Qlogic Compute node
          
          cmdline# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.78@o2ib /opt/lustre/bin/lst-bench.sh
          
          ..snip.
          [LNet Bandwidth of lfrom]
          [R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
          [W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
          [LNet Rates of lto]
          [R] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
          [W] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
          [LNet Bandwidth of lto]
          [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          
          lfrom:
          12345-192.168.55.78@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 31 expired]
          Total 1 error nodes in lfrom
          lto:
          Total 0 error nodes in lto
          Batch is stopped
          session is ended
          [root@john99 ~]#
          
          
          
          LFROM node dmesg:
          
          LustreError: 1512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -103
          LNet: 1514:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.44.199@o2ib44, timeout 64.
          LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
          LustreError: 1510:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
          LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 29 previous similar messages
          

          Or..

          Eth <-> Qlogic Test 2:
          LTO - Qlogic Compute node
          LFROM - VM with Mellanox 100G NIC
          
          cmdline# TM=300 LTO=192.168.55.78@o2ib LFROM=10.8.49.155@tcp201 /opt/lustre/bin/lst-bench.sh
          
          ..snip.
          [LNet Bandwidth of lfrom]
          [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          [LNet Rates of lto]
          [R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
          [W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
          [LNet Bandwidth of lto]
          [R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          [W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
          
          lfrom:
          12345-10.8.49.155@tcp201: [Session 32 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 64 expired]
          Total 1 error nodes in lfrom
          lto:
          12345-192.168.55.78@o2ib: [Session 0 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 63 expired]
          Total 1 error nodes in lto
          Batch is stopped
          session is ended
          [root@john99 ~]# 
          
          
          LTO node dmesg:
          
          [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.55.78@o2ib, timeout 64.
          [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Skipped 31 previous similar messages
          [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.55.78@o2ib failed with -110
          [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 31 previous similar messages
          

          Summary of passed test:

          QDR <-> QDR Test 1:
          LTO - Qlogic Compute node
          LFROM - Qlogic Lnet router HCA
          
          QDR <-> QDR Test 2:
          LTO - Qlogic Lnet router HCA
          LFROM - Qlogic Compute node
          
          OPA <-> OPA Test 1:
          LTO - OPA Compute node
          LFROM - OPA Lnet router HCA
          
          OPA <-> OPA Test 2:
          LTO - OPA Lnet router HCA
          LFROM - OPA Compute node
          
          Ethernet <-> Ethernet Test 1:
          LTO - VM with Mellanox 100G NIC
          LFROM - Lnet router with Mellanox 100G NIC
          
          Ethernet <-> Ethernet Test 2:
          LTO - Lnet router with Mellanox 100G NIC
          LFROM - VM with Mellanox 100G NIC
          
          QDR <-> OPA Test 1:
          LTO - Qlogic Compute node
          LFROM - OPA Compute node
          
          Eth <-> OPA Test 1:
          LTO - VM with Mellanox 100G NIC
          LFROM - OPA Compute node
          
          Eth <-> OPA Test 2:
          LTO - VM with Mellanox 100G NIC
          LFROM - OPA Compute node
          

          Summary of failed tests:

          QDR <-> OPA Test 2:
          LTO - OPA Compute node
          LFROM - Qlogic Compute node
          
          Eth <-> Qlogic Test 1:
          LTO - VM with Mellanox 100G NIC
          LFROM - Qlogic Compute node
          
          Eth <-> Qlogic Test 2:
          LTO - Qlogic Compute node
          LFROM - VM with Mellanox 100G NIC
          

          I modified one of our compute nodes today and re-configured the Qlogic HCA's on that node (as well as the Qlogic HCA the router). Running either of the following lnetctl net: configurations for the Qlogic HCA showed the same failed results as above. Selftests withing Qlogic only on either of these configs works without fail, the problems are only between Qlogic and some other fabric type.

          Config 1:

              - net type: o2ib
                local NI(s):
                  - nid: 192.168.55.231@o2ib
                    status: up
                    interfaces:
                        0: ib0
                    tunables:
                        peer_timeout: 180
                        peer_credits: 128
                        peer_buffer_credits: 0
                        credits: 1024
                    lnd tunables:
                        peercredits_hiw: 64
                        map_on_demand: 32
                        concurrent_sends: 256
                        fmr_pool_size: 2048
                        fmr_flush_trigger: 512
                        fmr_cache: 1
                        ntx: 2048
                        conns_per_peer: 4
                    tcp bonding: 0
                    dev cpt: 1
                    CPT: "[0,1]"
          

          Config 2:

              - net type: o2ib
                local NI(s):
                  - nid: 192.168.55.231@o2ib
                    status: up
                    interfaces:
                        0: ib0
                    tunables:
                        peer_timeout: 180
                        peer_credits: 8
                        peer_buffer_credits: 0
                        credits: 256
                    lnd tunables:
                        peercredits_hiw: 4
                        map_on_demand: 0
                        concurrent_sends: 8
                        fmr_pool_size: 512
                        fmr_flush_trigger: 384
                        fmr_cache: 1
                        ntx: 512
                        conns_per_peer: 1
                    tcp bonding: 0
                    dev cpt: 1
                    CPT: "[0,1]"
          

          lnet-tests-21_aug_2018.txt

          Any thoughts on what we should be looking at?

          Cheers,
          Simon

          scadmin SC Admin added a comment - Hi Guys, To update this: I went through all the scenarios doing a 5min selftests for each combination of eth/qdr/opa via our routers. This included tests between a node of each fabric type and the routers respective HCA/NIC and between nodes on different fabrics. The common factor in each failure event is the Qlogic HCA. We cannot reliably route between Qlogic and Ethernet or OPA. We can route fine between Ethernet and OPA / Ethernet. Failed selftests show up as this in dmesg/or message logs: Eg. QDR <-> OPA Test 2: LTO - OPA Compute node LFROM - Qlogic Compute node cmdline# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.78@o2ib /opt/lustre/bin/lst-bench.sh ..snip. [LNet Bandwidth of lfrom] [R] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [LNet Rates of lto] [R] Avg: 2 RPC/s Min: 2 RPC/s Max: 2 RPC/s [W] Avg: 2 RPC/s Min: 2 RPC/s Max: 2 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-192.168.55.78@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 31 expired] Total 1 error nodes in lfrom lto: Total 0 error nodes in lto Batch is stopped session is ended [root@john99 ~]# LFROM node dmesg: LustreError: 1512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -103 LNet: 1514:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.44.199@o2ib44, timeout 64. LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110 LustreError: 1510:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110 LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 29 previous similar messages Or.. Eth <-> Qlogic Test 2: LTO - Qlogic Compute node LFROM - VM with Mellanox 100G NIC cmdline# TM=300 LTO=192.168.55.78@o2ib LFROM=10.8.49.155@tcp201 /opt/lustre/bin/lst-bench.sh ..snip. [LNet Bandwidth of lfrom] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [LNet Rates of lto] [R] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [W] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-10.8.49.155@tcp201: [Session 32 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 64 expired] Total 1 error nodes in lfrom lto: 12345-192.168.55.78@o2ib: [Session 0 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 63 expired] Total 1 error nodes in lto Batch is stopped session is ended [root@john99 ~]# LTO node dmesg: [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.55.78@o2ib, timeout 64. [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Skipped 31 previous similar messages [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.55.78@o2ib failed with -110 [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 31 previous similar messages Summary of passed test: QDR <-> QDR Test 1: LTO - Qlogic Compute node LFROM - Qlogic Lnet router HCA QDR <-> QDR Test 2: LTO - Qlogic Lnet router HCA LFROM - Qlogic Compute node OPA <-> OPA Test 1: LTO - OPA Compute node LFROM - OPA Lnet router HCA OPA <-> OPA Test 2: LTO - OPA Lnet router HCA LFROM - OPA Compute node Ethernet <-> Ethernet Test 1: LTO - VM with Mellanox 100G NIC LFROM - Lnet router with Mellanox 100G NIC Ethernet <-> Ethernet Test 2: LTO - Lnet router with Mellanox 100G NIC LFROM - VM with Mellanox 100G NIC QDR <-> OPA Test 1: LTO - Qlogic Compute node LFROM - OPA Compute node Eth <-> OPA Test 1: LTO - VM with Mellanox 100G NIC LFROM - OPA Compute node Eth <-> OPA Test 2: LTO - VM with Mellanox 100G NIC LFROM - OPA Compute node Summary of failed tests: QDR <-> OPA Test 2: LTO - OPA Compute node LFROM - Qlogic Compute node Eth <-> Qlogic Test 1: LTO - VM with Mellanox 100G NIC LFROM - Qlogic Compute node Eth <-> Qlogic Test 2: LTO - Qlogic Compute node LFROM - VM with Mellanox 100G NIC I modified one of our compute nodes today and re-configured the Qlogic HCA's on that node (as well as the Qlogic HCA the router). Running either of the following lnetctl net: configurations for the Qlogic HCA showed the same failed results as above. Selftests withing Qlogic only on either of these configs works without fail, the problems are only between Qlogic and some other fabric type. Config 1: - net type: o2ib local NI(s): - nid: 192.168.55.231@o2ib status: up interfaces: 0: ib0 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 tcp bonding: 0 dev cpt: 1 CPT: "[0,1]" Config 2: - net type: o2ib local NI(s): - nid: 192.168.55.231@o2ib status: up interfaces: 0: ib0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 0 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 tcp bonding: 0 dev cpt: 1 CPT: "[0,1]" lnet-tests-21_aug_2018.txt Any thoughts on what we should be looking at? Cheers, Simon
          pjones Peter Jones added a comment -

          Amir

          Could you please help here?

          Thanks

          Peter

          pjones Peter Jones added a comment - Amir Could you please help here? Thanks Peter

          People

            ashehata Amir Shehata (Inactive)
            scadmin SC Admin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: