[LU-9914] Dynamic Discovery - discovery hangs if max_interfaces is changed from 200->16 Created: 25/Aug/17 Updated: 29/Jan/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | patch | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
NOTE: I see that after this patch: Without this patch the problem was being hidden, by an immediate failure. Steps: Peer 2:
net:
- net type: lo
local NI(s):
- nid: 0@lo
status: up
- net type: tcp
local NI(s):
- nid: 192.168.122.30@tcp
status: up
interfaces:
0: eth0
- nid: 192.168.122.31@tcp
status: up
interfaces:
0: eth1
- nid: 192.168.122.32@tcp
status: up
interfaces:
0: eth2
- nid: 192.168.122.33@tcp
status: up
interfaces:
0: eth3
- nid: 192.168.122.34@tcp
status: up
interfaces:
0: eth4
- nid: 192.168.122.35@tcp
status: up
interfaces:
0: eth5
- nid: 192.168.122.36@tcp
status: up
interfaces:
0: eth6
- nid: 192.168.122.37@tcp
status: up
interfaces:
0: eth7
- nid: 192.168.122.38@tcp
status: up
interfaces:
0: eth8
- nid: 192.168.122.39@tcp
status: up
interfaces:
0: eth9
- nid: 192.168.122.40@tcp
status: up
interfaces:
0: eth10
- nid: 192.168.122.41@tcp
status: up
interfaces:
0: eth11
- nid: 192.168.122.42@tcp
status: up
interfaces:
0: eth12
- nid: 192.168.122.43@tcp
status: up
interfaces:
0: eth13
- nid: 192.168.122.44@tcp
status: up
interfaces:
0: eth14
- nid: 192.168.122.45@tcp
status: up
interfaces:
0: eth15
- nid: 192.168.122.46@tcp
status: up
interfaces:
0: eth16
#peer 1
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp --if eth0,eth1
# max_interfaces default to 200
lnetctl discover 192.168.122.30@tcp
lnetctl set max_interfaces 16
# discover hangs (I kill it... so it might come back after a while, but haven't waited)
lnetctl discover 192.168.122.30@tcp
|
| Comments |
| Comment by Amir Shehata (Inactive) [ 25/Aug/17 ] |
|
problem is here: 1154 int 1155 lnet_ping_info_validate(struct lnet_ping_info *pinfo) 1156 { 1157 »·······if (!pinfo) 1158 »·······»·······return -EINVAL; 1159 »·······if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC) 1160 »·······»·······return -EPROTO; 1161 »·······if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS)) 1162 »·······»·······return -EPROTO; 1163 »·······/* Loopback is guaranteed to be present */ 1164 »·······if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max) 1165 »·······»·······return -ERANGE; 1166 »·······if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND) 1167 »·······»·······return -EPROTO; 1168 »·······return 0; 1169 } 2103 »·······/* 2104 »······· * A reply with invalid or corrupted info. Set PING_FAILED to 2105 »······· * trigger a retry. 2106 »······· */ 2107 »·······rc = lnet_ping_info_validate(&pbuf->pb_info); 2108 »·······if (rc) { 2109 »·······»·······lp->lp_state |= LNET_PEER_PING_FAILED; 2110 »·······»·······lp->lp_ping_error = 0; 2111 »·······»·······CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n", 2112 »·······»······· libcfs_nid2str(lp->lp_primary_nid), rc); 2113 »·······»·······goto out; 2114 »·······} Doesn't look like the state machine is handling the ping failure properly. Basically, the local lnet_interfaces_max is less than the number of interfaces on the far end 16 < 18. So we should get an -ERANGE. Looks like we're stuck in a loop retrying the ping for discover and it keeps failing with the same error: (peer.c:2112:lnet_discovery_event_reply()) Corrupted Ping Reply from 192.168.122.30@tcp: -34 |
| Comment by Olaf Weber [ 25/Aug/17 ] |
|
To be honest, lnet_interfaces_max exists only to avoid hard-coding a limit, and you ought to run with compatible values across the cluster. Meaning that lnet_interfaces_max on each node should be at least the number of interfaces of its peers. Still, what happens here isn't exactly graceful handling of the problematic configuration. My proposal would be to fail discovery of nodes that have more interfaces than lnet_interfaces_max, add some checks to prevent discovery from retrying, and emit an error message indicating that this problem has been encountered.
|
| Comment by Gerrit Updater [ 25/Aug/17 ] |
|
Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28714 |