[LU-11230] QIB route to OPA LNet drops / selftest fail - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.10.4
Labels:
- LNet
- lnet-testing
Environment:
OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5

Epic/Theme:
- lnet
Severity:
3
Epic:
- lnet
- performance
Rank (Obsolete):
9223372036854775807

Description

Hi Folks,

Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail.

In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers.

For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine.

We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets.

Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics?

Shortened example of failed selftest:

[root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh
LST_SESSION = 755
SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No
192.168.55.77@o2ib are added to session
192.168.44.199@o2ib44 are added to session
Test was added successfully
bulk_read is running now
Capturing statistics for 300 secs [LNet Rates of lfrom]
[R] Avg: 3163     RPC/s Min: 3163     RPC/s Max: 3163     RPC/s
[W] Avg: 1580     RPC/s Min: 1580     RPC/s Max: 1580     RPC/s
[LNet Bandwidth of lfrom]
[R] Avg: 1581.81  MiB/s Min: 1581.81  MiB/s Max: 1581.81  MiB/s
[W] Avg: 0.24     MiB/s Min: 0.24     MiB/s Max: 0.24     MiB/s

etc...

[LNet Bandwidth of lfrom]
[R] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[W] Avg: 0.01     MiB/s Min: 0.01     MiB/s Max: 0.01     MiB/s
[LNet Rates of lto]
[R] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[W] Avg: 0        RPC/s Min: 0        RPC/s Max: 0        RPC/s
[LNet Bandwidth of lto]
[R] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s
[W] Avg: 0.00     MiB/s Min: 0.00     MiB/s Max: 0.00     MiB/s

lfrom:
12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired]
Total 1 error nodes in lfrom
lto:
Total 0 error nodes in lto
Batch is stopped
session is ended

Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable.

Cheers,
Simon

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dk.log.xz
6.23 MB
23/Aug/18 11:14 PM
lnet.conf
2 kB
09/Aug/18 1:14 PM
lnetctl_export_lnet-router.txt
16 kB
23/Aug/18 11:14 PM
lnetctl_export_opa.txt
10 kB
23/Aug/18 11:14 PM
lnetctl_export_qlogic.txt
4 kB
23/Aug/18 11:14 PM
lnet-tests-21_aug_2018.txt
12 kB
22/Aug/18 7:46 AM

Activity

People

Assignee:: Amir Shehata (Inactive)

Reporter:: SC Admin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Aug/18 1:27 PM

Updated:: 01/Oct/18 12:19 PM