Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When a client is decommissioned, it stays forever in the peer list of the servers and generates a stream of messages like:

      lnet_handle_recovery_reply()) peer NI (10.11.12.13@o2ib) recovery failed with -113
      lnet_handle_recovery_reply()) Skipped 1234 similar messages
      

      It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.

      However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.

      This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.

      Attachments

        Issue Links

          Activity

            [LU-17519] Remove dead peers automatically

            cbordage I don't think we came to a final decision regarding peer deletion. 

            The options as I remember were:

            • do not delete peers, just use a reasonable lnet_recovery_limit
            • optionally delete peers as soon as lnet_recovery_limit is reached
            • introduce another module parameter (timeout) which would allow peers to get deleted some time after lnet_recovery_limit is reached
            • hardcode peer deletion timeout to be something after lnet_recovery_limit is reached but make deletion optional

            I think the simplest would be to "optionally delete peers as soon as lnet_recovery_limit is reached" if we're agreed to delete inactive peers at all.

            Please feel free to review the options and voice your opinion. If you have the cycles it would be great if you take this on. 

            Thanks,

            Serguei

            ssmirnov Serguei Smirnov added a comment - cbordage I don't think we came to a final decision regarding peer deletion.  The options as I remember were: do not delete peers, just use a reasonable lnet_recovery_limit optionally delete peers as soon as lnet_recovery_limit is reached introduce another module parameter (timeout) which would allow peers to get deleted some time after lnet_recovery_limit is reached hardcode peer deletion timeout to be something after lnet_recovery_limit is reached but make deletion optional I think the simplest would be to "optionally delete peers as soon as lnet_recovery_limit is reached" if we're agreed to delete inactive peers at all. Please feel free to review the options and voice your opinion. If you have the cycles it would be great if you take this on.  Thanks, Serguei

            IIRC, ssmirnov was on something for this ticket.

            ssmirnov, am I right? If not, I can work on this ticket again.

            cbordage Cyril Bordage added a comment - IIRC, ssmirnov was on something for this ticket. ssmirnov , am I right? If not, I can work on this ticket again.

            What is needed for this ticket to move forward? I think there is still an issue with stale peers not being cleaned up, AFAIK?

            adilger Andreas Dilger added a comment - What is needed for this ticket to move forward? I think there is still an issue with stale peers not being cleaned up, AFAIK?

            adilger I agree with you in general, but I'd like to clarify there's no caveats related to primary NID locking before we decide to erase peer records for which recovery timed out.

            For example, consider a client executing a mount command which lists ":"-separated server NIDs so that we have S1 and S2 servers with NS1 and NS2 NIDs respectively. In LNet this should result in peer records getting created for both S1 and S2, featuring NS1 and NS2 NIDs. As the peers are getting created for S1 and S2, NS1 and NS2 NIDs should get locked as primary. If after some time S2 goes down and is not responding to recovery attempts for longer than lnet_recovery_limit, the peer record for S2 will be deleted. Later S2 may come back online. It is not clear to me whether lustre trying to talk to S2 in this scenario will result in the same peer record for S2 as seen prevously as a result of parsing the mount command or there is a chance that a different S2 NID may get locked as primary.

            Alternatively we could consider deleting only peers with unlocked primary NIDs when their recovery expires. If I understand the issue correctly, most of the time the recovery errors keep getting reported for the clients which went offline. The servers don't lock the client primary  NIDs so this case should be covered.

            ssmirnov Serguei Smirnov added a comment - adilger I agree with you in general, but I'd like to clarify there's no caveats related to primary NID locking before we decide to erase peer records for which recovery timed out. For example, consider a client executing a mount command which lists ":"-separated server NIDs so that we have S1 and S2 servers with NS1 and NS2 NIDs respectively. In LNet this should result in peer records getting created for both S1 and S2, featuring NS1 and NS2 NIDs. As the peers are getting created for S1 and S2, NS1 and NS2 NIDs should get locked as primary. If after some time S2 goes down and is not responding to recovery attempts for longer than lnet_recovery_limit, the peer record for S2 will be deleted. Later S2 may come back online. It is not clear to me whether lustre trying to talk to S2 in this scenario will result in the same peer record for S2 as seen prevously as a result of parsing the mount command or there is a chance that a different S2 NID may get locked as primary. Alternatively we could consider deleting only peers with unlocked primary NIDs when their recovery expires. If I understand the issue correctly, most of the time the recovery errors keep getting reported for the clients which went offline. The servers don't lock the client primary  NIDs so this case should be covered.

            ssmirnov is there any benefit to keeping a "dead" peer around after LNet has stopped trying to recover the connection? Is there useful information in the peer connection state that would be lost if it was dropped at this point? Does this trigger RPC errors back to Lustre if there are messages queued for that peer?

            I'm assuming that LNet would retry connecting to the peer if it gets a new/retried Lustre RPC request for that NID, regardless of whether it had given up on reconnection by itself, so from my POV I don't see a huge benefit of having LNet retry for a long time, and let that be driven by the Lustre-level RPC retries.

            adilger Andreas Dilger added a comment - ssmirnov is there any benefit to keeping a "dead" peer around after LNet has stopped trying to recover the connection? Is there useful information in the peer connection state that would be lost if it was dropped at this point? Does this trigger RPC errors back to Lustre if there are messages queued for that peer? I'm assuming that LNet would retry connecting to the peer if it gets a new/retried Lustre RPC request for that NID, regardless of whether it had given up on reconnection by itself, so from my POV I don't see a huge benefit of having LNet retry for a long time, and let that be driven by the Lustre-level RPC retries.

            adilger routers are supposed to be periodically pinged by their user nodes using alive_router_check_interval regardless of the peer state, so I don't see implications there

            It is true that lnet_recovery_limit parameter can be used to halt recovery attempts after the specified timeout. LNet peer itself is not currently being deleted when this happens.

             

             

             

            ssmirnov Serguei Smirnov added a comment - adilger routers are supposed to be periodically pinged by their user nodes using alive_router_check_interval regardless of the peer state, so I don't see implications there It is true that lnet_recovery_limit parameter can be used to halt recovery attempts after the specified timeout. LNet peer itself is not currently being deleted when this happens.      

            Patch https://review.whamcloud.com/54408 "LU-14654 lnet: Correct peer NI recovery age out calculation" changed peer recovery to be bound by lnet_recovery_limit so that it will not retry indefinitely if this value is non-zero.

            Unfortunately, this value defaults to 0 since its introduction in patch https://review.whamcloud.com/39716 "LU-13569 lnet: Introduce lnet_recovery_limit parameter". Is there any reason why this shouldn't be set to some reasonable value (e.g. 300s) so that LNet stops trying to recover an interface after some time? Wouldn't a Lustre-level RPC retry trigger LNet to re-establish a new connection to a peer NID if it is still needed?

            Are there any implications for LNet routers no longer retrying connections, or is the expectation that the remote peer establish a new connection after some time (assuming it was restarted)? What if there is a network-level failure longer than lnet_recovery_limit and all the routers stop trying to communicate with each other? Is there something at a higher level that would trigger the establishment of new connections again?

            adilger Andreas Dilger added a comment - Patch https://review.whamcloud.com/54408 " LU-14654 lnet: Correct peer NI recovery age out calculation " changed peer recovery to be bound by lnet_recovery_limit so that it will not retry indefinitely if this value is non-zero. Unfortunately, this value defaults to 0 since its introduction in patch https://review.whamcloud.com/39716 " LU-13569 lnet: Introduce lnet_recovery_limit parameter ". Is there any reason why this shouldn't be set to some reasonable value (e.g. 300s) so that LNet stops trying to recover an interface after some time? Wouldn't a Lustre-level RPC retry trigger LNet to re-establish a new connection to a peer NID if it is still needed? Are there any implications for LNet routers no longer retrying connections, or is the expectation that the remote peer establish a new connection after some time (assuming it was restarted)? What if there is a network-level failure longer than lnet_recovery_limit and all the routers stop trying to communicate with each other? Is there something at a higher level that would trigger the establishment of new connections again?

            "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54465
            Subject: LU-17519 lnet: remove dead peer nis automatically
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 11cafaf1701d5df4aa8e843d28ffb6a585e171ac

            gerrit Gerrit Updater added a comment - "Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54465 Subject: LU-17519 lnet: remove dead peer nis automatically Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 11cafaf1701d5df4aa8e843d28ffb6a585e171ac

            People

              cbordage Cyril Bordage
              cbordage Cyril Bordage
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: