[LU-16709] LNet: locking multiple NIDs of the same MR peer as primary results in incorrect representation Created: 04/Apr/23 Updated: 16/Jul/23 Resolved: 28/Jun/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Serguei Smirnov | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | lnet, multi-rail | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
System configurations which result in Lustre layer specifying the same MR peer using multiple NIDs cause an issue with primary NID locking logic: when "primary nid locking" feature is enabled, LNet creates separate peer records, each record containing one NID of the MR peer as "locked primary". After the discovery completes in the background, these records are not being merged. This results in incorrect peer representation. Here's an example: server: # lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: tcp local NI(s): - nid: 192.168.122.50@tcp status: up interfaces: 0: eth0 - nid: 192.168.122.134@tcp status: up interfaces: 0: ens12 client: # mount -t lustre 192.168.122.134@tcp:192.168.122.50@tcp:/lustrewt /mnt/lustrefs # lnetctl peer show peer: - primary nid: 192.168.122.134@tcp Multi-Rail: True peer ni: - nid: 192.168.122.134@tcp state: NA - primary nid: 192.168.122.50@tcp Multi-Rail: True peer ni: - nid: 192.168.122.50@tcp state: NA |
| Comments |
| Comment by Chris Horn [ 04/Apr/23 ] |
This is an incorrect NID specification, isn't it? NIDs belonging to the same server are supposed to be comma-separated. NIDs belonging to other servers should be colon-separated. LNet arguably did the correct thing here. |
| Comment by Shuichi Ihara [ 05/Apr/23 ] |
|
In my case, the problem still exists even server and server connections without client mount below. master without patch # clush -a lnetctl peer show | dshbak
----------------
ai400x2-3-vm1
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm2
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- primary nid: 10.1.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- primary nid: 10.1.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm3
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- primary nid: 10.1.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- primary nid: 10.1.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm4
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- primary nid: 10.1.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
After patch https://review.whamcloud.com/#/c/fs/lustre-release/+/50530/ # clush -a lnetctl peer show | dshbak
----------------
ai400x2-3-vm1
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm2
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm3
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
----------------
ai400x2-3-vm4
----------------
peer:
- primary nid: 10.0.11.202@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.202@o2ib12
state: NA
- nid: 10.1.11.202@o2ib12
state: NA
- primary nid: 10.0.11.203@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.203@o2ib12
state: NA
- nid: 10.1.11.203@o2ib12
state: NA
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.200@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.200@o2ib12
state: NA
- nid: 10.1.11.200@o2ib12
state: NA
- primary nid: 10.0.11.201@o2ib12
Multi-Rail: True
peer ni:
- nid: 10.0.11.201@o2ib12
state: NA
- nid: 10.1.11.201@o2ib12
state: NA
I also attached llog dump for a reference. |
| Comment by Serguei Smirnov [ 05/Apr/23 ] |
|
Hi Chris, For example, this sequence also leads to the similar "broken" peer record:
Not sure if this reproducer is valid either though because mount command is not listing both server NIDs. |
| Comment by Gerrit Updater [ 28/Jun/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50530/ |
| Comment by Peter Jones [ 28/Jun/23 ] |
|
Seems to have merged for 2.16 |