Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11981

lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.4
    • Lustre 2.12.0
    • clients and routers: Lustre 2.12.0_1.chaos
      lustre servers: Lustre 2.10.6_2.chaos

      Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64
      Clients OmniPath <-> routers <-> Servers mlx5
    • 3
    • 9223372036854775807

    Description

      Over the span of about 20 minutes, routers reported the following in their console logs:
      2019-02-19 10:05:02 [330235.278414] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
      2019-02-19 10:05:02 [330235.294305] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1646 previous similar messages

      While the lustre servers were being rebooted.
      (0, 2) corresponds to:
      msg->msg_ev.status == 0 (success)
      msg->msg_health_status == 2 (LNET_MSG_STATUS_LOCAL_DROPPED)

      See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.

      Attachments

        Activity

          [LU-11981] lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)
          pjones Peter Jones added a comment -

          Landed for 2.12.4. Not needed on master

          pjones Peter Jones added a comment - Landed for 2.12.4. Not needed on master

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/
          Subject: LU-11981 lnet: clean up error message
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: f549927ea633b910a8c788fa970af742b3bf10c1

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/ Subject: LU-11981 lnet: clean up error message Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: f549927ea633b910a8c788fa970af742b3bf10c1

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001
          Subject: LU-11981 lnet: clean up error message
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 96a278829b7a375e05f23b538f8db876a68caa71

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001 Subject: LU-11981 lnet: clean up error message Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 96a278829b7a375e05f23b538f8db876a68caa71

          No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:

          LU-11477 lnet: handle health for incoming messages

          I'll create a patch just to change this log level to debug and push it on b2_12

          ashehata Amir Shehata (Inactive) added a comment - No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch: LU-11477 lnet: handle health for incoming messages I'll create a patch just to change this log level to debug and push it on b2_12
          ofaaland Olaf Faaland added a comment - - edited

          Hmm.  We're seeing

          2019-12-11 10:01:08 [  972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)

          with Lustre 2.12.3.

          I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that?

          ofaaland Olaf Faaland added a comment - - edited Hmm.  We're seeing 2019-12-11 10:01:08 [ 972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) with Lustre 2.12.3. I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that?

          there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket.

          The particular patch which resolves this issue is:

          LU-12199 lnet: Ensure md is detached when msg is not committed

          However, I would suggest moving to 2.12.3 which includes this fix and others.

          ashehata Amir Shehata (Inactive) added a comment - there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket. The particular patch which resolves this issue is: LU-12199 lnet: Ensure md is detached when msg is not committed However, I would suggest moving to 2.12.3 which includes this fix and others.

          <poke> Thanks

          ofaaland Olaf Faaland added a comment - <poke> Thanks

          Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message

          ashehata Amir Shehata (Inactive) added a comment - Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message
          ofaaland Olaf Faaland added a comment -

          Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging.

          ofaaland Olaf Faaland added a comment - Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging.

          would you be able to turn on net logging

          lctl set_param debug=+"net neterror"

          and capture the logs when you reproduce this message. I have made some changes in this area as part of LU-11477. I want to see if my changes there resolve this particular problem. I can then port it to 2.12.

          ashehata Amir Shehata (Inactive) added a comment - would you be able to turn on net logging lctl set_param debug=+"net neterror" and capture the logs when you reproduce this message. I have made some changes in this area as part of LU-11477 . I want to see if my changes there resolve this particular problem. I can then port it to 2.12.

          People

            ashehata Amir Shehata (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: