Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.4
-
OPA, lustre 2.10.4, Qlogic QDR IB, Centos 7.5
-
3
-
9223372036854775807
Description
Hi Folks,
Looking for some assistance on this one. We're having trouble with reliable LNet routing between Qlogic and OPA clients. Basically, we see long pauses in I/O transfers between moving data between the two fabric types. Testing with lnet_selftest has show that over many hours, some tests (300second runs) will randomly fail.
In recent testing over the last few nights it seems to fail more often when LTO = OPA and LFROM = QIB. As far as I can tell, buffers and lnetctl stats seem, etc. to look during transfers, then suddenly msgs_alloc and proc/sys/lnet/peers queue drops to zero right when lnet_selftest starts showing zero sized transfers.
For LNet settings: With mis-matched (ie: ko2iblnd settings that arent the same) Lnet router OPA <-> compute/storage node OPA would basically always give me errors. With matched and 'intel optimized' settings I've not yet seen it fail. Ethernet routing to OPA also seems to work fine.
We have the QIB's LNet configuration set to the same as the other nodes on the QIB fabric. I'll attach the config to this ticket if that helps. In case we have some settings incorrectly applied to one of the IB nets.
Are there any special settings we need to apply when trying routing between old & new 'Truescale' fabrics?
Shortened example of failed selftest:
[root@gstar057 ~]# TM=300 LTO=192.168.44.199@o2ib44 LFROM=192.168.55.77@o2ib /lfs/data0/lst-bench.sh LST_SESSION = 755 SESSION: lstread FEATURES: 1 TIMEOUT: 300 FORCE: No 192.168.55.77@o2ib are added to session 192.168.44.199@o2ib44 are added to session Test was added successfully bulk_read is running now Capturing statistics for 300 secs [LNet Rates of lfrom] [R] Avg: 3163 RPC/s Min: 3163 RPC/s Max: 3163 RPC/s [W] Avg: 1580 RPC/s Min: 1580 RPC/s Max: 1580 RPC/s [LNet Bandwidth of lfrom] [R] Avg: 1581.81 MiB/s Min: 1581.81 MiB/s Max: 1581.81 MiB/s [W] Avg: 0.24 MiB/s Min: 0.24 MiB/s Max: 0.24 MiB/s etc... [LNet Bandwidth of lfrom] [R] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [W] Avg: 0.01 MiB/s Min: 0.01 MiB/s Max: 0.01 MiB/s [LNet Rates of lto] [R] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [W] Avg: 0 RPC/s Min: 0 RPC/s Max: 0 RPC/s [LNet Bandwidth of lto] [R] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s [W] Avg: 0.00 MiB/s Min: 0.00 MiB/s Max: 0.00 MiB/s lfrom: 12345-192.168.55.77@o2ib: [Session 32 brw errors, 0 ping errors] [RPC: 18 errors, 0 dropped, 94 expired] Total 1 error nodes in lfrom lto: Total 0 error nodes in lto Batch is stopped session is ended
Bit of help with this would be really appreciated. Let me know which logs would be the most helpful, eg. repeating tests with debug flags enabled can be done if that helps. I could certainly have made a configuration error - if something doesn't look right with the lnet.conf let me know. We can't seem to find any ko2iblnd settings that are reliable.
Cheers,
Simon