[LU-9914] Dynamic Discovery - discovery hangs if max_interfaces is changed from 200->16 Created: 25/Aug/17  Updated: 29/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Amir Shehata (Inactive)
Resolution: Unresolved Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

NOTE: I see that after this patch:
https://review.whamcloud.com/#/c/28702/

Without this patch the problem was being hidden, by an immediate failure.

Steps:

Peer 2:
net:
    - net type: lo
      local NI(s):
        - nid: 0@lo
          status: up
    - net type: tcp
      local NI(s):
        - nid: 192.168.122.30@tcp
          status: up
          interfaces:
              0: eth0
        - nid: 192.168.122.31@tcp
          status: up
          interfaces:
              0: eth1
        - nid: 192.168.122.32@tcp
          status: up
          interfaces:
              0: eth2
        - nid: 192.168.122.33@tcp
          status: up
          interfaces:
              0: eth3
        - nid: 192.168.122.34@tcp
          status: up
          interfaces:
              0: eth4
        - nid: 192.168.122.35@tcp
          status: up
          interfaces:
              0: eth5
        - nid: 192.168.122.36@tcp
          status: up
          interfaces:
              0: eth6
        - nid: 192.168.122.37@tcp
          status: up
          interfaces:
              0: eth7
        - nid: 192.168.122.38@tcp
          status: up
          interfaces:
              0: eth8
        - nid: 192.168.122.39@tcp
          status: up
          interfaces:
              0: eth9
        - nid: 192.168.122.40@tcp
          status: up
          interfaces:
              0: eth10
        - nid: 192.168.122.41@tcp
          status: up
          interfaces:
              0: eth11
        - nid: 192.168.122.42@tcp
          status: up
          interfaces:
              0: eth12
        - nid: 192.168.122.43@tcp
          status: up
          interfaces:
              0: eth13
        - nid: 192.168.122.44@tcp
          status: up
          interfaces:
              0: eth14
        - nid: 192.168.122.45@tcp
          status: up
          interfaces:
              0: eth15
        - nid: 192.168.122.46@tcp
          status: up
          interfaces:
              0: eth16

#peer 1
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp --if eth0,eth1
# max_interfaces default to 200
lnetctl discover 192.168.122.30@tcp
lnetctl set max_interfaces 16
# discover hangs (I kill it... so it might come back after a while, but haven't waited)
lnetctl discover 192.168.122.30@tcp


 Comments   
Comment by Amir Shehata (Inactive) [ 25/Aug/17 ]

problem is here:

1154 int
1155 lnet_ping_info_validate(struct lnet_ping_info *pinfo)
1156 {
1157 »·······if (!pinfo)
1158 »·······»·······return -EINVAL;
1159 »·······if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC)
1160 »·······»·······return -EPROTO;
1161 »·······if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS))
1162 »·······»·······return -EPROTO;
1163 »·······/* Loopback is guaranteed to be present */
1164 »·······if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max)
1165 »·······»·······return -ERANGE;
1166 »·······if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND)
1167 »·······»·······return -EPROTO;
1168 »·······return 0; 
1169 }


2103 »·······/*
2104 »······· * A reply with invalid or corrupted info. Set PING_FAILED to
2105 »······· * trigger a retry.
2106 »······· */
2107 »·······rc = lnet_ping_info_validate(&pbuf->pb_info);
2108 »·······if (rc) {
2109 »·······»·······lp->lp_state |= LNET_PEER_PING_FAILED;
2110 »·······»·······lp->lp_ping_error = 0;
2111 »·······»·······CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n",
2112 »·······»·······       libcfs_nid2str(lp->lp_primary_nid), rc);
2113 »·······»·······goto out;
2114 »·······}

Doesn't look like the state machine is handling the ping failure properly. Basically, the local lnet_interfaces_max is less than the number of interfaces on the far end 16 < 18. So we should get an -ERANGE.

Looks like we're stuck in a loop retrying the ping for discover and it keeps failing with the same error:

(peer.c:2112:lnet_discovery_event_reply()) Corrupted Ping Reply from 192.168.122.30@tcp: -34
Comment by Olaf Weber [ 25/Aug/17 ]

To be honest, lnet_interfaces_max exists only to avoid hard-coding a limit, and you ought to run with compatible values across the cluster. Meaning that lnet_interfaces_max on each node should be at least the number of interfaces of its peers.

Still, what happens here isn't exactly graceful handling of the problematic configuration. My proposal would be to fail discovery of nodes that have more interfaces than lnet_interfaces_max, add some checks to prevent discovery from retrying, and emit an error message indicating that this problem has been encountered.

 

Comment by Gerrit Updater [ 25/Aug/17 ]

Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28714
Subject: LU-9914 lnet: gracefully handle peers with too many NIs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b973f67c227f5b988afb052171cf74cc7a097157

Generated at Sat Feb 10 02:30:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.