[LU-11230] QIB route to OPA LNet drops / selftest fail Created: 09/Aug/18  Updated: 01/Oct/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: LNet, lnet-testing
Environment:

OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5


Attachments: File dk.log.xz     Text File lnet-tests-21_aug_2018.txt     File lnet.conf     Text File lnetctl_export_lnet-router.txt     Text File lnetctl_export_opa.txt     Text File lnetctl_export_qlogic.txt    
Epic/Theme: lnet
Severity: 3
Epic: lnet, performance
Rank (Obsolete): 9223372036854775807

 Description   

Hi Folks,

Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail.

In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers.

For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine.

We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets.

Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics?

Shortened example of failed selftest:

[root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh
LST_SESSION = 755
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.55.77@o2ib are added to session
192.168.44.199@o2ib44 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 300 secs [LNet Rates of lfrom]
[R] Avg: 3163     RPC/s Min: 3163     RPC/s Max: 3163     RPC/s
[W] Avg: 1580     RPC/s Min: 1580     RPC/s Max: 1580     RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 1581.81  MiB/s Min: 1581.81  MiB/s Max: 1581.81  MiB/s
[W] Avg: 0.24     MiB/s Min: 0.24     MiB/s Max: 0.24     MiB/s

etc...

[LNet Bandwidth of lfrom]
[R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[LNet Rates of lto]
[R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired]
Total 1 error nodes in lfrom
lto:
Total 0 error nodes in lto
Batch is stopped
session is ended

Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable.

Cheers,
Simon



 Comments   
Comment by Peter Jones [ 09/Aug/18 ]

Amir

Could you please help here?

Thanks

Peter

Comment by SC Admin (Inactive) [ 22/Aug/18 ]

Hi Guys,

To update this: I went through all the scenarios doing a 5min selftests for each combination of eth/qdr/opa via our routers. This included tests between a node of each fabric type and the routers respective HCA/NIC and between nodes on different fabrics. The common factor in each failure event is the Qlogic HCA. We cannot reliably route between Qlogic and Ethernet or OPA. We can route fine between Ethernet and OPA / Ethernet. Failed selftests show up as this in dmesg/or message logs:

Eg.

QDR <-> OPA Test 2:
LTO - OPA Compute node
LFROM - Qlogic Compute node

cmdline# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.78@o2ib /opt/lustre/bin/lst-bench.sh

..snip.
[LNet Bandwidth of lfrom]
[R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[LNet Rates of lto]
[R] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
[W] Avg: 2        RPC/s Min: 2        RPC/s Max: 2        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-192.168.55.78@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 31 expired]
Total 1 error nodes in lfrom
lto:
Total 0 error nodes in lto
Batch is stopped
session is ended
[root@john99 ~]#



LFROM node dmesg:

LustreError: 1512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -103
LNet: 1514:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.44.199@o2ib44, timeout 64.
LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
LustreError: 1510:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110
LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 29 previous similar messages

Or..

Eth <-> Qlogic Test 2:
LTO - Qlogic Compute node
LFROM - VM with Mellanox 100G NIC

cmdline# TM=300 LTO=192.168.55.78@o2ib LFROM=10.8.49.155@tcp201 /opt/lustre/bin/lst-bench.sh

..snip.
[LNet Bandwidth of lfrom]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[LNet Rates of lto]
[R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-10.8.49.155@tcp201: [Session 32 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 64 expired]
Total 1 error nodes in lfrom
lto:
12345-192.168.55.78@o2ib: [Session 0 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 63 expired]
Total 1 error nodes in lto
Batch is stopped
session is ended
[root@john99 ~]# 


LTO node dmesg:

[Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.55.78@o2ib, timeout 64.
[Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Skipped 31 previous similar messages
[Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.55.78@o2ib failed with -110
[Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 31 previous similar messages

Summary of passed test:

QDR <-> QDR Test 1:
LTO - Qlogic Compute node
LFROM - Qlogic Lnet router HCA

QDR <-> QDR Test 2:
LTO - Qlogic Lnet router HCA
LFROM - Qlogic Compute node

OPA <-> OPA Test 1:
LTO - OPA Compute node
LFROM - OPA Lnet router HCA

OPA <-> OPA Test 2:
LTO - OPA Lnet router HCA
LFROM - OPA Compute node

Ethernet <-> Ethernet Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - Lnet router with Mellanox 100G NIC

Ethernet <-> Ethernet Test 2:
LTO - Lnet router with Mellanox 100G NIC
LFROM - VM with Mellanox 100G NIC

QDR <-> OPA Test 1:
LTO - Qlogic Compute node
LFROM - OPA Compute node

Eth <-> OPA Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - OPA Compute node

Eth <-> OPA Test 2:
LTO - VM with Mellanox 100G NIC
LFROM - OPA Compute node

Summary of failed tests:

QDR <-> OPA Test 2:
LTO - OPA Compute node
LFROM - Qlogic Compute node

Eth <-> Qlogic Test 1:
LTO - VM with Mellanox 100G NIC
LFROM - Qlogic Compute node

Eth <-> Qlogic Test 2:
LTO - Qlogic Compute node
LFROM - VM with Mellanox 100G NIC

I modified one of our compute nodes today and re-configured the Qlogic HCA's on that node (as well as the Qlogic HCA the router). Running either of the following lnetctl net: configurations for the Qlogic HCA showed the same failed results as above. Selftests withing Qlogic only on either of these configs works without fail, the problems are only between Qlogic and some other fabric type.

Config 1:

    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.231@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 128
              peer_buffer_credits: 0
              credits: 1024
          lnd tunables:
              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"

Config 2:

    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.231@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"

lnet-tests-21_aug_2018.txt

Any thoughts on what we should be looking at?

Cheers,
Simon

Comment by Amir Shehata (Inactive) [ 23/Aug/18 ]

Hi Simon,

If you can get me the following info that would be great:

  1. Configuration from OPA node, router node and QLogic node (lnetctl export > config.yaml). Would be great if each one is in a separate file.
  2. Are you able to ping from the OPA -> QLOGIC and from QLOGIC -> OPA with no problem? (lnetctl ping <NID>). If you're encountering a failure with simple ping, let's turn on and capture the logging: lctl set_param debug=+"net neterror" THEN run ping test THEN lctl dk > log.dk.
  3. If problem is not reproducible via ping then, if you can turn on debugging as above run a short selftest run (which would contain errors) and then capture logging.

thanks

amir

Comment by SC Admin (Inactive) [ 23/Aug/18 ]

Hi Amir,

I should add: there are no issues we can see with routes being marked down on either side or lctl pings failing.  In general, everything appears OK. I wasn't sure if a really short test would capture it, so a ran the standard 5 min test in which is failed maybe 30 second to a minute into the test. I've attached three configs and the dk log as requested.

Cheers,

Simon

 

Comment by Amir Shehata (Inactive) [ 27/Aug/18 ]

Hi Simon,

peer:
    - primary nid: 192.168.44.21@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.21@o2ib44
          min_tx_credits: -4815
    - primary nid: 192.168.44.22@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.22@o2ib44
          min_tx_credits: -4868
    - primary nid: 192.168.44.51@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.51@o2ib44
          state: NA
          min_tx_credits: -10849
    - primary nid: 192.168.44.52@o2ib44
      Multi-Rail: False
      peer ni:
        - nid: 192.168.44.52@o2ib44
          min_tx_credits: -12366

The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)?

I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?

Would you also be able to share the lnet-selftest script you're using?

Also for the QIB I see that you tried both of these configs:

              peercredits_hiw: 64
              map_on_demand: 32
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 4 

and

              peercredits_hiw: 64
              map_on_demand: 0
              concurrent_sends: 256
              fmr_pool_size: 2048
              fmr_flush_trigger: 512
              fmr_cache: 1
              ntx: 2048
              conns_per_peer: 1

If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node.

My preference though is to stick with conns_per_peer: 1 for QLOGIC. the conns_per_peer 4 was intended for OPA interfaces only.

Finally, would we be able to setup a live debug session?

thanks

amir

Comment by SC Admin (Inactive) [ 28/Aug/18 ]

Hi Amir,

 

> The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)?

These peers are not relevant for the purposes of the lnet_selftest (from my understanding). They are however important for the purposes of actual file transfers though.. which is why we're going back to basic lnet_selftest's to verify the network between fabrics.

The below peers are (respectively) MDS1, MDS2, OSS1 for home & apps. etc, OSS2 for home & apps. etc. There are another 8 x OSS's for the main large filesystem too not mentioned here but use IP's 192.168.44.13[1-8]@o2ib44:

peer:
 - primary nid: 192.168.44.21@o2ib44
 - primary nid: 192.168.44.22@o2ib44
 - primary nid: 192.168.44.51@o2ib44
 - primary nid: 192.168.44.52@o2ib44

> I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted?

Yeah, dmesg and /var/log/messages are really light for errors. The only errors that appear during the test period were what I pasted in. eg: the "failed with -103", and "failed with -110" examples.

> Would you also be able to share the lnet-selftest script you're using?

Yup. It's a pretty standard one: 

#!/bin/sh
#
# Simple wrapper script for LNET Selftest
#

# Parameters are supplied as environment variables
# The defaults are reasonable for quick verification.
# For in-depth benchmarking, increase the time (TM)
# variable to e.g. 60 seconds, and iterate over
# concurrency to find optimal values.
#
# Reference: http://wiki.lustre.org/LNET_Selftest

# Concurrency
CN=${CN:-32}
#Size
SZ=${SZ:-1M}
# Length of time to run test (secs)
TM=${TM:-10}
# Which BRW test to run (read or write)
BRW=${BRW:-"read"}
# Checksum calculation (simple or full)
CKSUM=${CKSUM:-"simple"}

# The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs.
# LFROM="10.10.2.21@tcp"
LFROM=${LFROM:?ERROR: the LFROM variable is not set}
# The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs.
# LTO="10.10.2.22@tcp"
LTO=${LTO:?ERROR: the LTO variable is not set}

### End of customisation.

export LST_SESSION=$$
echo LST_SESSION = ${LST_SESSION}
lst new_session lst${BRW}
lst add_group lfrom ${LFROM}
lst add_group lto ${LTO}
lst add_batch bulk_${BRW}
lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \
  --concurrency=${CN} check=${CKSUM} size=${SZ}
lst run bulk_${BRW}
echo -n "Capturing statistics for ${TM} secs "
lst stat lfrom lto &
LSTPID=$!
# Delay loop with interval markers displayed every 5 secs.
# Test time is rounded up to the nearest 5 seconds.
i=1
j=$((${TM}/5))
if [ $((${TM}%5)) -ne 0 ]; then let j++; fi
while [ $i -le $j ]; do
  sleep 5
  let i++
done
kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1
echo
lst show_error lfrom lto
lst stop bulk_${BRW}
lst end_session

> If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node.

In my testing I found that a Qlogic compute node to the Qlogic interface on the lnet router proved to be working reliably. The same goes for OPA compute nodes to the OPA interface on the lnet router - they worked just fine. In both cases though (now this is testing my memory!), if I had mismatched the ko2iblnd settings between a compute/routers respective fabric interfaces then I would get issues (depending which settings were mismatched).. but having them matched works just fine.

Apart from the two Qlogic configs you just mentioned here. I'd also tested this configuration which also gave poor results with routing between fabric types. This is actually our current lnet setup on all Qlogic compute nodes with the exception of my test host / lnet router where I've been going through changing the parameters to try and figure this all out:

    - net type: o2ib
      local NI(s):
        - nid: 192.168.55.75@o2ib
          status: up
          interfaces:
              0: ib0
          tunables:
              peer_timeout: 180
              peer_credits: 8
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 4
              map_on_demand: 0
              concurrent_sends: 8
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          tcp bonding: 0
          dev cpt: 1
          CPT: "[0,1]"

> Finally, would we be able to setup a live debug session?

Not a problem at all. We're east coast Australia, I can setup a live session to help debug this if you want to pick a time that suits us both.

Cheers,
Simon

Comment by Amir Shehata (Inactive) [ 29/Aug/18 ]

Does 4pm PST, 9AM (your time) work? If so, let me know the date that works for you. Would need to be able to share screens or something of that sort to debug further.

 

Comment by SC Admin (Inactive) [ 05/Sep/18 ]

Hi Amir,

Yep. That time will work. I'll email you through some details for a meeting with prospective dates.

Cheers,
Simon

Comment by Peter Jones [ 15/Sep/18 ]

Has this proposed meeting taken place yet?

Generated at Sat Feb 10 02:42:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.