Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.15.0
-
None
-
3
-
9223372036854775807
Description
On kjlmo13 we saw incorrect peer entry for two servers after client mount:
[root@c-lmo1049 ~]# lnetctl debug recovery -p peer NI recovery: nid-0: 10.230.77.11@o2ib1 nid-1: 10.230.77.9@o2ib1 [root@c-lmo1049 ~]# lnetctl debug recovery -l [root@c-lmo1049 ~]# lnetctl peer show --nid 10.230.77.11@o2ib1 peer: - primary nid: 10.230.77.10@o2ib1 Multi-Rail: True peer ni: - nid: 10.230.77.10@o2ib1 state: NA - nid: 10.230.77.11@o2ib1 state: NA [root@c-lmo1049 ~]# lnetctl peer show --nid 10.230.77.9@o2ib1 peer: - primary nid: 10.230.77.8@o2ib1 Multi-Rail: True peer ni: - nid: 10.230.77.8@o2ib1 state: NA - nid: 10.230.77.9@o2ib1 state: NA [root@c-lmo1049 ~]#
Those servers' actual NIDs were:
---------------- kjlmo1304 ---------------- 10.230.77.8@o2ib1 ---------------- kjlmo1305 ---------------- 10.230.77.10@o2ib1 ----------------
Issue is config log processing with LUS-9293/LU-14661. Config log says these servers have two NIDs each. Discovery correctly deletes the missing NIDs, but then later config log processing adds them back. At that point the peer is "up to date" so discovery is not performed again.
We should either mark this peer as out of date or just skip adding temporary peer NIs to a peer that is considered up to date. Probably the latter is best because then we do not require an additional discovery handshake.