Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.12.0
-
ARM clients: kernel 4.14.0-115.2.2.el7a.aarch64 MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
x86 servers: rhel 7.6, same mofed
2.12.0 no patches
-
3
-
9223372036854775807
Description
The client side symptom is alternating success/failure of lnet ping to an oss. On the oss we see:
# lnetctl peer show --nid n1-ib0@o2ib peer: - primary nid: xxx.xxx.xxx.17@o2ib Multi-Rail: True peer ni: - nid: xxx.xxx.xxx.17@o2ib state: NA - nid: xxx.xxx.xxx.182@o2ib state: NA # lnetctl peer show --nid n2-ib0@o2ib peer: - primary nid: xxx.xxx.xxx.17@o2ib Multi-Rail: True peer ni: - nid: xxx.xxx.xxx.17@o2ib state: NA - nid: xxx.xxx.xxx.182@o2ib state: NA
where n1 has ipaddr ending in 182, and n2 has ipaddr ending 17.
The results in logs is lots of timeouts, put NAKs, mount failures and general chaos including plenty of the following message:
kernel: LustreError: 21309:0:(events.c:450:server_bulk_callback()) event type 3, status -61, desc ffff9c46e0303200
The logs lead us to believe there were IB problems, but the fabric was found to be clean and responsive between the affected client nodes and oss servers.
Planning to turn off discovery going forward. I'll leave a few clients drained for awhile in case there is info you might need.
fyi, rebooting the client does not change the behavior, rebooting the server clears it. Also manually deleting the incorrect peer nid on the server and re-adding the correct info for the missing peer also clears it.
Also, clients are running socket direct, but only one IPoIB interface is configured and in use by lnet.
Attachments
Issue Links
- duplicates
-
LU-11478 LNet: discovery sequence numbers could be misleading
-
- Resolved
-
software stack described above.
2 MDS nodes, 1 MDT each. 40 OSS nodes, 2 zfs OSTs each. module params:
Steps before seeing this issue - basically reboot 2500+ nodes and some of them multiple times.
Out of the 2500+ clients I found the error on 4 pairs of nodes, one pair on each of 4 distinct servers out of the 40 OSS nodes.
Some relevant items are that:
The servers and clients are running mellanox stack as listed above. As I said above, the servers have only one port into the IB network. The clients also have only one port, which appears as 2 due to the use of the mellanox socket direct feature. Only one of the ports is configured for use by lnet, and in fact only one (ib0) has an IPoIB address configured. So if the code is detecting another port automatically via some IB call and assuming it can be used that is an error. I can imagine that bad assumption leading to addition of an inappropriate nid from some list to the wrong client under high load conditions, exposing some concurrency flaw not previously seen.
It seems unlikely to be reproducible on a vm with tcp, and I'm afraid that a reproducer would have to come from HPE internal testing if possible at all. The machine where this occurred is in production with Dynamic Discovery disabled. It's likely that someone familiar with Dynamic Discovery is going to have to look at the code and decipher how that feature is doing the automatic detection.
I think my point here is that nothing on this machine should have been detected as multirail.
If you have other more specific questions about the node configuration I'll be glad to answer what I can.