[LU-13603] Multirail discovers wrong nids (randomly) Created: 27/May/20  Updated: 06/Jan/21  Resolved: 06/Jan/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Amir Shehata (Inactive)
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Related
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

 On the client that discovered wrong nids we get.

  lnetctl peer show
......
   - primary nid: 10.151.27.65@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 10.151.27.65@o2ib
          state: NA
        - nid: 10.151.27.62@o2ib
          state: NA
 

On the server with 10.151.27.65

nbp8-oss4 ~ # lctl list_nids
10.151.27.65@o2ib
nbp8-oss1 ~ # lctl list_nids
10.151.27.62@o2ib


 Comments   
Comment by Peter Jones [ 27/May/20 ]

Amir

Could you please advise?

Thanks

Peter

Comment by Amir Shehata (Inactive) [ 29/May/20 ]

Is there a particular sequence of operations which causes this to happen?

There were a few problems fixed around discovery. They stem from toggling discovery on/off. Also in some cases the PUSH step of discovery is not done, which could cause similar symptoms. IE the two peers endup with different views.

Below is a list of patches landed recently which fix these issues.

LU-13477 lnet: Force full discovery cycle
LU-13478 lnet: handle discovery off properly
LU-13471 lnet: use the same src nid for discovery
LU-12312 lnet: handle no discovery flag
LU-13028 lnet: advertise discovery when toggled
LU-13278 lnet: Reconcile discovery push and reply handling
Comment by Mahmoud Hanafi [ 02/Jun/20 ]

We were doing software updates. But during the update we experienced hardware issue and the servers were down for several hours. 

Comment by Mahmoud Hanafi [ 06/Jan/21 ]

This can be closed

Generated at Sat Feb 10 03:02:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.