Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.12.0
-
ARM clients: kernel 4.14.0-115.2.2.el7a.aarch64 MLNX_OFED_LINUX-4.5-1.0.1.0 (OFED-4.5-1.0.1)
x86 servers: rhel 7.6, same mofed
2.12.0 no patches
-
3
-
9223372036854775807
Description
The client side symptom is alternating success/failure of lnet ping to an oss. On the oss we see:
# lnetctl peer show --nid n1-ib0@o2ib peer: - primary nid: xxx.xxx.xxx.17@o2ib Multi-Rail: True peer ni: - nid: xxx.xxx.xxx.17@o2ib state: NA - nid: xxx.xxx.xxx.182@o2ib state: NA # lnetctl peer show --nid n2-ib0@o2ib peer: - primary nid: xxx.xxx.xxx.17@o2ib Multi-Rail: True peer ni: - nid: xxx.xxx.xxx.17@o2ib state: NA - nid: xxx.xxx.xxx.182@o2ib state: NA
where n1 has ipaddr ending in 182, and n2 has ipaddr ending 17.
The results in logs is lots of timeouts, put NAKs, mount failures and general chaos including plenty of the following message:
kernel: LustreError: 21309:0:(events.c:450:server_bulk_callback()) event type 3, status -61, desc ffff9c46e0303200
The logs lead us to believe there were IB problems, but the fabric was found to be clean and responsive between the affected client nodes and oss servers.
Planning to turn off discovery going forward. I'll leave a few clients drained for awhile in case there is info you might need.
fyi, rebooting the client does not change the behavior, rebooting the server clears it. Also manually deleting the incorrect peer nid on the server and re-adding the correct info for the missing peer also clears it.
Also, clients are running socket direct, but only one IPoIB interface is configured and in use by lnet.
Attachments
Issue Links
- duplicates
-
LU-11478 LNet: discovery sequence numbers could be misleading
-
- Resolved
-
I talked to Amir about this at LUG.
Amir pointed to queued patches, and suggested that the patches submitted for
LU-11478may address this. From my reading, I think that fixingLU-11478will allow the problem to be corrected automatically, but I am not sure that it will actually prevent the problem from occurring."The problem" being insertion of the wrong nid to a peer's list of NIs. I think there is a fair chance to find this through code inspection by someone more familiar with the peer discovery process. My suspicion is that a peer node id is held by an unlocked reference as an peer ni is inserted. I noticed
LU-12264in the queued patches - it looks very close, but I'm not sure that this specific problem is addressed.I suspect that the number of peer entries on the host is a factor, so duplication may not be possible without a large cluster.