Details

    • Improvement
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When a client is decommissioned, it stays forever in the peer list of the servers and generates a stream of messages like:

      lnet_handle_recovery_reply()) peer NI (10.11.12.13@o2ib) recovery failed with -113
      lnet_handle_recovery_reply()) Skipped 1234 similar messages
      

      It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.

      However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.

      This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.

      Attachments

        Issue Links

          Activity

            [LU-17519] Remove dead peers automatically
            chunter-wc Chris Hunter made changes -
            Link New: This issue is related to DDN-5559 [ DDN-5559 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-5287 [ DDN-5287 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to LU-14654 [ LU-14654 ]
            adilger Andreas Dilger made changes -
            Priority Original: Minor [ 4 ] New: Major [ 3 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-5073 [ DDN-5073 ]
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-4781 [ DDN-4781 ]
            adilger Andreas Dilger made changes -
            Description Original: When a client is decommissioned, it stays forever in the peer list of the servers.

            It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.

            However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.

            This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.
            New: When a client is decommissioned, it stays forever in the peer list of the servers and generates a stream of messages like:
            {noformat}
            lnet_handle_recovery_reply()) peer NI (10.11.12.13@o2ib) recovery failed with -113
            lnet_handle_recovery_reply()) Skipped 1234 similar messages
            {noformat}

            It is cleaner to remove it from the list of peers. It avoids the need to change the value of lnet_recovery_limit to remove LNetError messages about this removed client. Moreover, having this parameter changed can mask a problem on an active but faulty node.

            However, it can be cumbersome to remove it manually. That is why an automatic deletion could be relevant.

            This feature could use a parameter to enable it and to set the delay before a client is considered to be removed.
            pjones Peter Jones made changes -
            Link New: This issue is related to NVDCSE-162 [ NVDCSE-162 ]
            cbordage Cyril Bordage created issue -

            People

              cbordage Cyril Bordage
              cbordage Cyril Bordage
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: