Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9914

Dynamic Discovery - discovery hangs if max_interfaces is changed from 200->16

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      NOTE: I see that after this patch:
      https://review.whamcloud.com/#/c/28702/

      Without this patch the problem was being hidden, by an immediate failure.

      Steps:

      Peer 2:
      net:
          - net type: lo
            local NI(s):
              - nid: 0@lo
                status: up
          - net type: tcp
            local NI(s):
              - nid: 192.168.122.30@tcp
                status: up
                interfaces:
                    0: eth0
              - nid: 192.168.122.31@tcp
                status: up
                interfaces:
                    0: eth1
              - nid: 192.168.122.32@tcp
                status: up
                interfaces:
                    0: eth2
              - nid: 192.168.122.33@tcp
                status: up
                interfaces:
                    0: eth3
              - nid: 192.168.122.34@tcp
                status: up
                interfaces:
                    0: eth4
              - nid: 192.168.122.35@tcp
                status: up
                interfaces:
                    0: eth5
              - nid: 192.168.122.36@tcp
                status: up
                interfaces:
                    0: eth6
              - nid: 192.168.122.37@tcp
                status: up
                interfaces:
                    0: eth7
              - nid: 192.168.122.38@tcp
                status: up
                interfaces:
                    0: eth8
              - nid: 192.168.122.39@tcp
                status: up
                interfaces:
                    0: eth9
              - nid: 192.168.122.40@tcp
                status: up
                interfaces:
                    0: eth10
              - nid: 192.168.122.41@tcp
                status: up
                interfaces:
                    0: eth11
              - nid: 192.168.122.42@tcp
                status: up
                interfaces:
                    0: eth12
              - nid: 192.168.122.43@tcp
                status: up
                interfaces:
                    0: eth13
              - nid: 192.168.122.44@tcp
                status: up
                interfaces:
                    0: eth14
              - nid: 192.168.122.45@tcp
                status: up
                interfaces:
                    0: eth15
              - nid: 192.168.122.46@tcp
                status: up
                interfaces:
                    0: eth16
      
      #peer 1
      modprobe lnet
      lnetctl lnet configure
      lnetctl net add --net tcp --if eth0,eth1
      # max_interfaces default to 200
      lnetctl discover 192.168.122.30@tcp
      lnetctl set max_interfaces 16
      # discover hangs (I kill it... so it might come back after a while, but haven't waited)
      lnetctl discover 192.168.122.30@tcp
      

      Attachments

        Issue Links

          Activity

            [LU-9914] Dynamic Discovery - discovery hangs if max_interfaces is changed from 200->16

            Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28714
            Subject: LU-9914 lnet: gracefully handle peers with too many NIs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b973f67c227f5b988afb052171cf74cc7a097157

            gerrit Gerrit Updater added a comment - Olaf Weber (olaf.weber@hpe.com) uploaded a new patch: https://review.whamcloud.com/28714 Subject: LU-9914 lnet: gracefully handle peers with too many NIs Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b973f67c227f5b988afb052171cf74cc7a097157

            To be honest, lnet_interfaces_max exists only to avoid hard-coding a limit, and you ought to run with compatible values across the cluster. Meaning that lnet_interfaces_max on each node should be at least the number of interfaces of its peers.

            Still, what happens here isn't exactly graceful handling of the problematic configuration. My proposal would be to fail discovery of nodes that have more interfaces than lnet_interfaces_max, add some checks to prevent discovery from retrying, and emit an error message indicating that this problem has been encountered.

             

            olaf Olaf Weber (Inactive) added a comment - To be honest, lnet_interfaces_max exists only to avoid hard-coding a limit, and you ought to run with compatible values across the cluster. Meaning that lnet_interfaces_max on each node should be at least the number of interfaces of its peers. Still, what happens here isn't exactly graceful handling of the problematic configuration. My proposal would be to fail discovery of nodes that have more interfaces than lnet_interfaces_max , add some checks to prevent discovery from retrying, and emit an error message indicating that this problem has been encountered.  
            ashehata Amir Shehata (Inactive) added a comment - - edited

            problem is here:

            1154 int
            1155 lnet_ping_info_validate(struct lnet_ping_info *pinfo)
            1156 {
            1157 »·······if (!pinfo)
            1158 »·······»·······return -EINVAL;
            1159 »·······if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC)
            1160 »·······»·······return -EPROTO;
            1161 »·······if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS))
            1162 »·······»·······return -EPROTO;
            1163 »·······/* Loopback is guaranteed to be present */
            1164 »·······if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max)
            1165 »·······»·······return -ERANGE;
            1166 »·······if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND)
            1167 »·······»·······return -EPROTO;
            1168 »·······return 0; 
            1169 }
            
            
            2103 »·······/*
            2104 »······· * A reply with invalid or corrupted info. Set PING_FAILED to
            2105 »······· * trigger a retry.
            2106 »······· */
            2107 »·······rc = lnet_ping_info_validate(&pbuf->pb_info);
            2108 »·······if (rc) {
            2109 »·······»·······lp->lp_state |= LNET_PEER_PING_FAILED;
            2110 »·······»·······lp->lp_ping_error = 0;
            2111 »·······»·······CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n",
            2112 »·······»·······       libcfs_nid2str(lp->lp_primary_nid), rc);
            2113 »·······»·······goto out;
            2114 »·······}
            

            Doesn't look like the state machine is handling the ping failure properly. Basically, the local lnet_interfaces_max is less than the number of interfaces on the far end 16 < 18. So we should get an -ERANGE.

            Looks like we're stuck in a loop retrying the ping for discover and it keeps failing with the same error:

            (peer.c:2112:lnet_discovery_event_reply()) Corrupted Ping Reply from 192.168.122.30@tcp: -34
            
            ashehata Amir Shehata (Inactive) added a comment - - edited problem is here: 1154 int 1155 lnet_ping_info_validate(struct lnet_ping_info *pinfo) 1156 { 1157 »······· if (!pinfo) 1158 »·······»······· return -EINVAL; 1159 »······· if (pinfo->pi_magic != LNET_PROTO_PING_MAGIC) 1160 »·······»······· return -EPROTO; 1161 »······· if (!(pinfo->pi_features & LNET_PING_FEAT_NI_STATUS)) 1162 »·······»······· return -EPROTO; 1163 »······· /* Loopback is guaranteed to be present */ 1164 »······· if (pinfo->pi_nnis < 1 || pinfo->pi_nnis > lnet_interfaces_max) 1165 »·······»······· return -ERANGE; 1166 »······· if (LNET_NETTYP(LNET_NIDNET(LNET_PING_INFO_LONI(pinfo))) != LOLND) 1167 »·······»······· return -EPROTO; 1168 »······· return 0; 1169 } 2103 »·······/* 2104 »······· * A reply with invalid or corrupted info. Set PING_FAILED to 2105 »······· * trigger a retry. 2106 »······· */ 2107 »·······rc = lnet_ping_info_validate(&pbuf->pb_info); 2108 »······· if (rc) { 2109 »·······»·······lp->lp_state |= LNET_PEER_PING_FAILED; 2110 »·······»·······lp->lp_ping_error = 0; 2111 »·······»·······CDEBUG(D_NET, "Corrupted Ping Reply from %s: %d\n" , 2112 »·······»······· libcfs_nid2str(lp->lp_primary_nid), rc); 2113 »·······»······· goto out; 2114 »·······} Doesn't look like the state machine is handling the ping failure properly. Basically, the local lnet_interfaces_max is less than the number of interfaces on the far end 16 < 18. So we should get an -ERANGE. Looks like we're stuck in a loop retrying the ping for discover and it keeps failing with the same error: (peer.c:2112:lnet_discovery_event_reply()) Corrupted Ping Reply from 192.168.122.30@tcp: -34

            People

              ashehata Amir Shehata (Inactive)
              ashehata Amir Shehata (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: