[LU-11230] QIB route to OPA LNet drops / selftest fail Created: 09/Aug/18 Updated: 01/Oct/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | LNet, lnet-testing | ||
| Environment: |
OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5 |
||
| Attachments: |
|
| Epic/Theme: | lnet |
| Severity: | 3 |
| Epic: | lnet, performance |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Hi Folks, Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail. In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers. For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine. We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets. Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics? Shortened example of failed selftest: [root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh LST_SESSION = 755 SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No 192.168.55.77@o2ib are added to session 192.168.44.199@o2ib44 are added to session Test was added successfully bulk_read is running now Capturing statistics for 300 secs [LNet Rates of lfrom] [R] Avg: 3163 RPC/s Min: 3163 RPC/s Max: 3163 RPC/s [W] Avg: 1580 RPC/s Min: 1580 RPC/s Max: 1580 RPC/s [LNet Bandwidth of lfrom] [R] Avg: 1581.81 MiB/s Min: 1581.81 MiB/s Max: 1581.81 MiB/s [W] Avg: 0.24 MiB/s Min: 0.24 MiB/s Max: 0.24 MiB/s etc... [LNet Bandwidth of lfrom] [R] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [LNet Rates of lto] [R] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [W] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired] Total 1 error nodes in lfrom lto: Total 0 error nodes in lto Batch is stopped session is ended Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable. Cheers, |
| Comments |
| Comment by Peter Jones [ 09/Aug/18 ] |
|
Amir Could you please help here? Thanks Peter |
| Comment by SC Admin (Inactive) [ 22/Aug/18 ] |
|
Hi Guys, To update this: I went through all the scenarios doing a 5min selftests for each combination of eth/qdr/opa via our routers. This included tests between a node of each fabric type and the routers respective HCA/NIC and between nodes on different fabrics. The common factor in each failure event is the Qlogic HCA. We cannot reliably route between Qlogic and Ethernet or OPA. We can route fine between Ethernet and OPA / Ethernet. Failed selftests show up as this in dmesg/or message logs: Eg. QDR <-> OPA Test 2: LTO - OPA Compute node LFROM - Qlogic Compute node cmdline# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.78@o2ib /opt/lustre/bin/lst-bench.sh ..snip. [LNet Bandwidth of lfrom] [R] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [LNet Rates of lto] [R] Avg: 2 RPC/s Min: 2 RPC/s Max: 2 RPC/s [W] Avg: 2 RPC/s Min: 2 RPC/s Max: 2 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-192.168.55.78@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 31 expired] Total 1 error nodes in lfrom lto: Total 0 error nodes in lto Batch is stopped session is ended [root@john99 ~]# LFROM node dmesg: LustreError: 1512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -103 LNet: 1514:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.44.199@o2ib44, timeout 64. LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110 LustreError: 1510:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.44.199@o2ib44 failed with -110 LustreError: 1509:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 29 previous similar messages Or.. Eth <-> Qlogic Test 2: LTO - Qlogic Compute node LFROM - VM with Mellanox 100G NIC cmdline# TM=300 LTO=192.168.55.78@o2ib LFROM=10.8.49.155@tcp201 /opt/lustre/bin/lst-bench.sh ..snip. [LNet Bandwidth of lfrom] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [LNet Rates of lto] [R] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [W] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-10.8.49.155@tcp201: [Session 32 brw errors, 0 ping errors] [RPC: 0 errors, 0 dropped, 64 expired] Total 1 error nodes in lfrom lto: 12345-192.168.55.78@o2ib: [Session 0 brw errors, 0 ping errors] [RPC: 1 errors, 0 dropped, 63 expired] Total 1 error nodes in lto Batch is stopped session is ended [root@john99 ~]# LTO node dmesg: [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Client RPC expired: service 11, peer 12345-192.168.55.78@o2ib, timeout 64. [Tue Aug 21 22:26:11 2018] LNet: 25532:0:(rpc.c:1069:srpc_client_rpc_expired()) Skipped 31 previous similar messages [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) BRW RPC to 12345-192.168.55.78@o2ib failed with -110 [Tue Aug 21 22:26:11 2018] LustreError: 25512:0:(brw_test.c:344:brw_client_done_rpc()) Skipped 31 previous similar messages Summary of passed test: QDR <-> QDR Test 1: LTO - Qlogic Compute node LFROM - Qlogic Lnet router HCA QDR <-> QDR Test 2: LTO - Qlogic Lnet router HCA LFROM - Qlogic Compute node OPA <-> OPA Test 1: LTO - OPA Compute node LFROM - OPA Lnet router HCA OPA <-> OPA Test 2: LTO - OPA Lnet router HCA LFROM - OPA Compute node Ethernet <-> Ethernet Test 1: LTO - VM with Mellanox 100G NIC LFROM - Lnet router with Mellanox 100G NIC Ethernet <-> Ethernet Test 2: LTO - Lnet router with Mellanox 100G NIC LFROM - VM with Mellanox 100G NIC QDR <-> OPA Test 1: LTO - Qlogic Compute node LFROM - OPA Compute node Eth <-> OPA Test 1: LTO - VM with Mellanox 100G NIC LFROM - OPA Compute node Eth <-> OPA Test 2: LTO - VM with Mellanox 100G NIC LFROM - OPA Compute node Summary of failed tests: QDR <-> OPA Test 2: LTO - OPA Compute node LFROM - Qlogic Compute node Eth <-> Qlogic Test 1: LTO - VM with Mellanox 100G NIC LFROM - Qlogic Compute node Eth <-> Qlogic Test 2: LTO - Qlogic Compute node LFROM - VM with Mellanox 100G NIC I modified one of our compute nodes today and re-configured the Qlogic HCA's on that node (as well as the Qlogic HCA the router). Running either of the following lnetctl net: configurations for the Qlogic HCA showed the same failed results as above. Selftests withing Qlogic only on either of these configs works without fail, the problems are only between Qlogic and some other fabric type. Config 1: - net type: o2ib
local NI(s):
- nid: 192.168.55.231@o2ib
status: up
interfaces:
0: ib0
tunables:
peer_timeout: 180
peer_credits: 128
peer_buffer_credits: 0
credits: 1024
lnd tunables:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
Config 2: - net type: o2ib
local NI(s):
- nid: 192.168.55.231@o2ib
status: up
interfaces:
0: ib0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
Any thoughts on what we should be looking at? Cheers, |
| Comment by Amir Shehata (Inactive) [ 23/Aug/18 ] |
|
Hi Simon, If you can get me the following info that would be great:
thanks amir |
| Comment by SC Admin (Inactive) [ 23/Aug/18 ] |
|
Hi Amir, I should add: there are no issues we can see with routes being marked down on either side or lctl pings failing. In general, everything appears OK. I wasn't sure if a really short test would capture it, so a ran the standard 5 min test in which is failed maybe 30 second to a minute into the test. I've attached three configs and the dk log as requested. Cheers, Simon
|
| Comment by Amir Shehata (Inactive) [ 27/Aug/18 ] |
|
Hi Simon,
peer:
- primary nid: 192.168.44.21@o2ib44
Multi-Rail: False
peer ni:
- nid: 192.168.44.21@o2ib44
min_tx_credits: -4815
- primary nid: 192.168.44.22@o2ib44
Multi-Rail: False
peer ni:
- nid: 192.168.44.22@o2ib44
min_tx_credits: -4868
- primary nid: 192.168.44.51@o2ib44
Multi-Rail: False
peer ni:
- nid: 192.168.44.51@o2ib44
state: NA
min_tx_credits: -10849
- primary nid: 192.168.44.52@o2ib44
Multi-Rail: False
peer ni:
- nid: 192.168.44.52@o2ib44
min_tx_credits: -12366
The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)? I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted? Would you also be able to share the lnet-selftest script you're using? Also for the QIB I see that you tried both of these configs:
peercredits_hiw: 64
map_on_demand: 32
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 4
and
peercredits_hiw: 64
map_on_demand: 0
concurrent_sends: 256
fmr_pool_size: 2048
fmr_flush_trigger: 512
fmr_cache: 1
ntx: 2048
conns_per_peer: 1
If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node. My preference though is to stick with conns_per_peer: 1 for QLOGIC. the conns_per_peer 4 was intended for OPA interfaces only. Finally, would we be able to setup a live debug session? thanks amir |
| Comment by SC Admin (Inactive) [ 28/Aug/18 ] |
|
Hi Amir,
> The above is from the export-opa config file. The min tx credits are quiet low. That indicates a lot of queuing is happening on these peers. Are these peers relevant to the test you're running. They appear to be on the OPA network (o2ib44)? These peers are not relevant for the purposes of the lnet_selftest (from my understanding). They are however important for the purposes of actual file transfers though.. which is why we're going back to basic lnet_selftest's to verify the network between fabrics. The below peers are (respectively) MDS1, MDS2, OSS1 for home & apps. etc, OSS2 for home & apps. etc. There are another 8 x OSS's for the main large filesystem too not mentioned here but use IP's 192.168.44.13[1-8]@o2ib44: peer: - primary nid: 192.168.44.21@o2ib44 - primary nid: 192.168.44.22@o2ib44 - primary nid: 192.168.44.51@o2ib44 - primary nid: 192.168.44.52@o2ib44 > I didn't see any relevant errors in the log file you sent me. Are there any other errors in /var/log/messages? besides the one you pasted? Yeah, dmesg and /var/log/messages are really light for errors. The only errors that appear during the test period were what I pasted in. eg: the "failed with -103", and "failed with -110" examples. > Would you also be able to share the lnet-selftest script you're using? Yup. It's a pretty standard one: #!/bin/sh # # Simple wrapper script for LNET Selftest # # Parameters are supplied as environment variables # The defaults are reasonable for quick verification. # For in-depth benchmarking, increase the time (TM) # variable to e.g. 60 seconds, and iterate over # concurrency to find optimal values. # # Reference: http://wiki.lustre.org/LNET_Selftest # Concurrency CN=${CN:-32} #Size SZ=${SZ:-1M} # Length of time to run test (secs) TM=${TM:-10} # Which BRW test to run (read or write) BRW=${BRW:-"read"} # Checksum calculation (simple or full) CKSUM=${CKSUM:-"simple"} # The LST "from" list -- e.g. Lustre clients. Space separated list of NIDs. # LFROM="10.10.2.21@tcp" LFROM=${LFROM:?ERROR: the LFROM variable is not set} # The LST "to" list -- e.g. Lustre servers. Space separated list of NIDs. # LTO="10.10.2.22@tcp" LTO=${LTO:?ERROR: the LTO variable is not set} ### End of customisation. export LST_SESSION=$$ echo LST_SESSION = ${LST_SESSION} lst new_session lst${BRW} lst add_group lfrom ${LFROM} lst add_group lto ${LTO} lst add_batch bulk_${BRW} lst add_test --batch bulk_${BRW} --from lfrom --to lto brw ${BRW} \ --concurrency=${CN} check=${CKSUM} size=${SZ} lst run bulk_${BRW} echo -n "Capturing statistics for ${TM} secs " lst stat lfrom lto & LSTPID=$! # Delay loop with interval markers displayed every 5 secs. # Test time is rounded up to the nearest 5 seconds. i=1 j=$((${TM}/5)) if [ $((${TM}%5)) -ne 0 ]; then let j++; fi while [ $i -le $j ]; do sleep 5 let i++ done kill ${LSTPID} && wait ${LISTPID} >/dev/null 2>&1 echo lst show_error lfrom lto lst stop bulk_${BRW} lst end_session > If you run lnet_selftest from the router to the QLOGIC node, do you get any errors? I'm trying to see if the problem is restricted between the router under test and the node. In my testing I found that a Qlogic compute node to the Qlogic interface on the lnet router proved to be working reliably. The same goes for OPA compute nodes to the OPA interface on the lnet router - they worked just fine. In both cases though (now this is testing my memory!), if I had mismatched the ko2iblnd settings between a compute/routers respective fabric interfaces then I would get issues (depending which settings were mismatched).. but having them matched works just fine. Apart from the two Qlogic configs you just mentioned here. I'd also tested this configuration which also gave poor results with routing between fabric types. This is actually our current lnet setup on all Qlogic compute nodes with the exception of my test host / lnet router where I've been going through changing the parameters to try and figure this all out: - net type: o2ib
local NI(s):
- nid: 192.168.55.75@o2ib
status: up
interfaces:
0: ib0
tunables:
peer_timeout: 180
peer_credits: 8
peer_buffer_credits: 0
credits: 256
lnd tunables:
peercredits_hiw: 4
map_on_demand: 0
concurrent_sends: 8
fmr_pool_size: 512
fmr_flush_trigger: 384
fmr_cache: 1
ntx: 512
conns_per_peer: 1
tcp bonding: 0
dev cpt: 1
CPT: "[0,1]"
> Finally, would we be able to setup a live debug session? Not a problem at all. We're east coast Australia, I can setup a live session to help debug this if you want to pick a time that suits us both. Cheers, |
| Comment by Amir Shehata (Inactive) [ 29/Aug/18 ] |
|
Does 4pm PST, 9AM (your time) work? If so, let me know the date that works for you. Would need to be able to share screens or something of that sort to debug further.
|
| Comment by SC Admin (Inactive) [ 05/Sep/18 ] |
|
Hi Amir, Yep. That time will work. I'll email you through some details for a meeting with prospective dates. Cheers, |
| Comment by Peter Jones [ 15/Sep/18 ] |
|
Has this proposed meeting taken place yet? |