[LU-13603] Multirail discovers wrong nids (randomly) Created: 27/May/20 Updated: 06/Jan/21 Resolved: 06/Jan/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mahmoud Hanafi | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 2 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
On the client that discovered wrong nids we get.
lnetctl peer show
......
- primary nid: 10.151.27.65@o2ib
Multi-Rail: True
peer ni:
- nid: 10.151.27.65@o2ib
state: NA
- nid: 10.151.27.62@o2ib
state: NA
On the server with 10.151.27.65 nbp8-oss4 ~ # lctl list_nids 10.151.27.65@o2ib nbp8-oss1 ~ # lctl list_nids 10.151.27.62@o2ib |
| Comments |
| Comment by Peter Jones [ 27/May/20 ] |
|
Amir Could you please advise? Thanks Peter |
| Comment by Amir Shehata (Inactive) [ 29/May/20 ] |
|
Is there a particular sequence of operations which causes this to happen? There were a few problems fixed around discovery. They stem from toggling discovery on/off. Also in some cases the PUSH step of discovery is not done, which could cause similar symptoms. IE the two peers endup with different views. Below is a list of patches landed recently which fix these issues.
LU-13477 lnet: Force full discovery cycle
LU-13478 lnet: handle discovery off properly
LU-13471 lnet: use the same src nid for discovery
LU-12312 lnet: handle no discovery flag
LU-13028 lnet: advertise discovery when toggled
LU-13278 lnet: Reconcile discovery push and reply handling
|
| Comment by Mahmoud Hanafi [ 02/Jun/20 ] |
|
We were doing software updates. But during the update we experienced hardware issue and the servers were down for several hours. |
| Comment by Mahmoud Hanafi [ 06/Jan/21 ] |
|
This can be closed |