[LU-11853] Automated update peer NID state if client changed from multi-rail to non multi-rail Created: 11/Jan/19 Updated: 15/Jan/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Shuichi Ihara | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.12 and master |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Currently, when if client changed multi-rail to non multi-rail setting, client can't mount filesystem unless current client's peer nid state on servers removed. options lnet networks="o2ib10(ib0,ib2)"
[root@s184 ~]# mount -t lustre 10.0.11.90@o2ib10:/cache1 /cache1
[root@s184 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib10
local NI(s):
- nid: 10.0.10.184@o2ib10
status: up
interfaces:
0: ib0
- nid: 10.2.10.184@o2ib10
status: up
interfaces:
0: ib2
if NID state changed and remount lustre on client fails unless clear all that client state on all servers. options lnet networks="o2ib10(ib0)" [root@s184 ~]# umount -t lustre -a [root@s184 ~]# lustre_rmmod [root@s184 ~]# mount -t lustre 10.0.11.90@o2ib10:/cache1 /cache1 mount.lustre: mount 10.0.11.90@o2ib10:/cache1 at /cache1 failed: Input/output error Is the MGS running? Server side, client peer state is still multi-rail. [root@es14k-vm1 ~]# lnetctl peer show
peer:
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.92@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.92@o2ib10
state: NA
- nid: 10.1.11.92@o2ib10
state: NA
- primary nid: 10.0.11.91@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.91@o2ib10
state: NA
- nid: 10.1.11.91@o2ib10
state: NA
- primary nid: 10.0.11.93@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.93@o2ib10
state: NA
- nid: 10.1.11.93@o2ib10
state: NA
- primary nid: 10.0.10.184@o2ib10
Multi-Rail: True <------ Still Multi-rail
peer ni:
- nid: 10.0.10.184@o2ib10
state: NA
- nid: 10.2.10.184@o2ib10
state: NA
a workaround is removing nid state on all servers, then mount it again. that works, but perfer automated peer state update. [root@es14k-vm1 ~]# clush -g oss lnetctl peer del --prim_nid 10.0.10.184@o2ib10 --nid 10.0.10.184@o2ib10 [root@es14k-vm1 ~]# clush -g oss lnetctl peer del --prim_nid 10.0.10.184@o2ib10 --nid 10.2.10.184@o2ib10 [root@s184 ~]# mount -t lustre 10.0.11.90@o2ib10:/cache1 /cache1 [root@s184 ~]# |
| Comments |
| Comment by Amir Shehata (Inactive) [ 11/Jan/19 ] |
|
when you bring down the client and remount it, the client should be retriggering a discovery round, which would update the local peer on the servers. are you able to get net/neterror logging from when the client fails to mount on both client and server to see the reason for the failure? |
| Comment by Amir Shehata (Inactive) [ 11/Jan/19 ] |
|
Another question, has the config on the client changed? Was the first (primary) NID removed? IE the config went from:
- primary nid: 10.0.10.184@o2ib10
Multi-Rail: True <------ Still Multi-rail
peer ni:
- nid: 10.0.10.184@o2ib10
state: NA
- nid: 10.2.10.184@o2ib10
state: NA
to
- primary nid: 10.0.10.184@o2ib10
Multi-Rail: True <------ Still Multi-rail
peer ni:
- nid: 10.2.10.184@o2ib10
state: NA
? |
| Comment by Shuichi Ihara [ 12/Jan/19 ] |
Yes, client NID state updated on client side after chnaged to non multi-rail. please see below. [root@s184 ~]# lustre_rmmod
[root@s184 ~]# mount -t lustre 10.0.11.90@o2ib10:/cache1 /cache1
mount.lustre: mount 10.0.11.90@o2ib10:/cache1 at /cache1 failed: Input/output error
Is the MGS running?
[root@s184 ~]# lnetctl net show
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: o2ib10
local NI(s):
- nid: 10.0.10.184@o2ib10
status: up
interfaces:
0: ib0
[root@s184 ~]# ssh 10.0.11.90 lnetctl peer show
peer:
- primary nid: 0@lo
Multi-Rail: False
peer ni:
- nid: 0@lo
state: NA
- primary nid: 10.0.11.92@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.92@o2ib10
state: NA
- nid: 10.1.11.92@o2ib10
state: NA
- primary nid: 10.0.11.91@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.91@o2ib10
state: NA
- nid: 10.1.11.91@o2ib10
state: NA
- primary nid: 10.0.11.93@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.11.93@o2ib10
state: NA
- nid: 10.1.11.93@o2ib10
state: NA
- primary nid: 10.0.10.184@o2ib10
Multi-Rail: True
peer ni:
- nid: 10.0.10.184@o2ib10
state: NA
- nid: 10.2.10.184@o2ib10
state: NA
I'm collecting debug log and upload iy shortly. |
| Comment by Shuichi Ihara [ 12/Jan/19 ] |
|
attached debug log (net/neterr) lctl-dk-es14k-vm1.txt (one of server) and lctl-dk-s184-vm1.txt (client) when problme was reproduced. I did test several times and it seems sometimes NI update worked properly, but sometimes doesn't work. |
| Comment by Amir Shehata (Inactive) [ 15/Jan/19 ] |
|
There is an issue with the discovery mechanism. If you bring a peer down and then up with a changed NID list, the discovery mechanism will not pick the change up. This will result in the communication errors you're seeing. This has already been fixed as part of the Multi-Rail Router feature. https://review.whamcloud.com/#/c/33304/10 Is this an urgent issue that would require back porting this change? |