Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12886

A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.12.3
    • 2.12.3 RC1 (vanilla) on servers, CentOS 7.6, patched kernel; 2.12.0 + patches on clients
    • 4
    • 9223372036854775807

    Description

      After upgrading our servers on Fir (Sherlock's /scratch) to Lustre 2.12.3 RC1, we are noticing a lot of these messages on all Lustre servers:

      LNetError: 49537:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.0.10.201@o2ib7 added to recovery queue. Health = 900
      

      The NIDs reported are our Lustre routers, that are still running 2.12.0+patches (they are on the client clusters).

      Attaching logs from all servers as lnet_recoveryq.log

      This doesn't seem to have an impact on production and so far 2.12.3 RC1 has been just great for us (and we run it without additional patches now!). Thanks!

      Stéphane

      Attachments

        Issue Links

          Activity

            [LU-12886] A lot of LNetError: lnet_peer_ni_add_to_recoveryq_locked() messages
            ofaaland Olaf Faaland added a comment -

            Agreed

            ofaaland Olaf Faaland added a comment - Agreed
            pjones Peter Jones added a comment -

            The https://review.whamcloud.com/#/c/37718/ fix landed to b2_12 some weeks back so I think that it should be ok to close this ticket. Any objections?

            pjones Peter Jones added a comment - The https://review.whamcloud.com/#/c/37718/ fix landed to b2_12 some weeks back so I think that it should be ok to close this ticket. Any objections?
            ofaaland Olaf Faaland added a comment - - edited

            Hi Stephane,

            They've pushed a backport for b2_12 to gerrit, and it's been reviewed and passed tests, just needs to land:

            https://review.whamcloud.com/#/c/37718/

            ofaaland Olaf Faaland added a comment - - edited Hi Stephane, They've pushed a backport for b2_12 to gerrit, and it's been reviewed and passed tests, just needs to land: https://review.whamcloud.com/#/c/37718/

            Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!

            sthiell Stephane Thiell added a comment - Hello, we would also appreciate a backport of this patch (that just landed into master) to b2_12 as it is very verbose. Thanks!!

            Hi Olaf,

            It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review:

            https://review.whamcloud.com/#/c/37002/

            The "inconsistent" message was reduced here:

            f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message

            thanks

            amir

            ashehata Amir Shehata (Inactive) added a comment - Hi Olaf, It doesn't look like this patch made it in 2.12.4. It still hasn't landed. Needs an extra review: https://review.whamcloud.com/#/c/37002/ The "inconsistent" message was reduced here: f549927ea633b910a8c788fa970af742b3bf10c1 LU-11981 lnet: clean up error message thanks amir
            ofaaland Olaf Faaland added a comment -

            Hi Amir,

            What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages?

            thanks

            ofaaland Olaf Faaland added a comment - Hi Amir, What 2.12.4 patch(es) reduce the severity of the "added to recovery queue" messages? thanks

            It looks to me like LU-11981 patch addresses only:

            "Msg is in inconsistent state, don't perform health checking"

            not

            "added to recovery queue"

            ofaaland Olaf Faaland added a comment - It looks to me like LU-11981 patch addresses only: "Msg is in inconsistent state, don't perform health checking" not "added to recovery queue"
            pjones Peter Jones added a comment -

            LU-11981 I assume. It will be in 2.12.4

            pjones Peter Jones added a comment - LU-11981 I assume. It will be in 2.12.4

            We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            Hi Amir,
            Which patch?
            Thanks

            ofaaland Olaf Faaland added a comment - We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console. Hi Amir, Which patch? Thanks

            Hi Luis,

            You can turn off health on your setup

            lnetctl set health_sensitivity 0
            lnetctl set transaction_timeout 50 # or some value you'd like 
            lnetctl set retry_count 0

            We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            ashehata Amir Shehata (Inactive) added a comment - Hi Luis, You can turn off health on your setup lnetctl set health_sensitivity 0 lnetctl set transaction_timeout 50 # or some value you'd like lnetctl set retry_count 0 We've also committed a patch which reduces the severity of these messages so they will not be displayed on the console.

            People

              ashehata Amir Shehata (Inactive)
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: