[LU-15852] Don't add "temp" peer NIs after discovery completes Created: 12/May/22  Updated: 06/Dec/22  Resolved: 02/Nov/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

On kjlmo13 we saw incorrect peer entry for two servers after client mount:

[root@c-lmo1049 ~]# lnetctl debug recovery -p
peer NI recovery:
    nid-0: 10.230.77.11@o2ib1
    nid-1: 10.230.77.9@o2ib1
[root@c-lmo1049 ~]# lnetctl debug recovery -l
[root@c-lmo1049 ~]# lnetctl peer show --nid 10.230.77.11@o2ib1
peer:
    - primary nid: 10.230.77.10@o2ib1
      Multi-Rail: True
      peer ni:
        - nid: 10.230.77.10@o2ib1
          state: NA
        - nid: 10.230.77.11@o2ib1
          state: NA
[root@c-lmo1049 ~]# lnetctl peer show --nid 10.230.77.9@o2ib1
peer:
    - primary nid: 10.230.77.8@o2ib1
      Multi-Rail: True
      peer ni:
        - nid: 10.230.77.8@o2ib1
          state: NA
        - nid: 10.230.77.9@o2ib1
          state: NA
[root@c-lmo1049 ~]#

Those servers' actual NIDs were:

----------------
kjlmo1304
----------------
10.230.77.8@o2ib1
----------------
kjlmo1305
----------------
10.230.77.10@o2ib1
----------------

Issue is config log processing with LUS-9293/LU-14661. Config log says these servers have two NIDs each. Discovery correctly deletes the missing NIDs, but then later config log processing adds them back. At that point the peer is "up to date" so discovery is not performed again.

We should either mark this peer as out of date or just skip adding temporary peer NIs to a peer that is considered up to date. Probably the latter is best because then we do not require an additional discovery handshake.



 Comments   
Comment by Gerrit Updater [ 12/May/22 ]

"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/47322
Subject: LU-15852 lnet: Don't modify uptodate peer with temp NI
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aaa8ee182b3084d76092eb9497c1803fe99b7ad3

Comment by Gerrit Updater [ 02/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/47322/
Subject: LU-15852 lnet: Don't modify uptodate peer with temp NI
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8f718df474e453fbc69dfe90214e71565963f6db

Comment by Peter Jones [ 02/Nov/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:21:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.