Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11981

lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.4
    • Lustre 2.12.0
    • clients and routers: Lustre 2.12.0_1.chaos
      lustre servers: Lustre 2.10.6_2.chaos

      Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64
      Clients OmniPath <-> routers <-> Servers mlx5
    • 3
    • 9223372036854775807

    Description

      Over the span of about 20 minutes, routers reported the following in their console logs:
      2019-02-19 10:05:02 [330235.278414] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
      2019-02-19 10:05:02 [330235.294305] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1646 previous similar messages

      While the lustre servers were being rebooted.
      (0, 2) corresponds to:
      msg->msg_ev.status == 0 (success)
      msg->msg_health_status == 2 (LNET_MSG_STATUS_LOCAL_DROPPED)

      See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.

      Attachments

        Activity

          [LU-11981] lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)

          No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:

          LU-11477 lnet: handle health for incoming messages

          I'll create a patch just to change this log level to debug and push it on b2_12

          ashehata Amir Shehata (Inactive) added a comment - No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch: LU-11477 lnet: handle health for incoming messages I'll create a patch just to change this log level to debug and push it on b2_12
          ofaaland Olaf Faaland added a comment - - edited

          Hmm.  We're seeing

          2019-12-11 10:01:08 [  972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)

          with Lustre 2.12.3.

          I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that?

          ofaaland Olaf Faaland added a comment - - edited Hmm.  We're seeing 2019-12-11 10:01:08 [ 972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) with Lustre 2.12.3. I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that?

          there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket.

          The particular patch which resolves this issue is:

          LU-12199 lnet: Ensure md is detached when msg is not committed

          However, I would suggest moving to 2.12.3 which includes this fix and others.

          ashehata Amir Shehata (Inactive) added a comment - there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket. The particular patch which resolves this issue is: LU-12199 lnet: Ensure md is detached when msg is not committed However, I would suggest moving to 2.12.3 which includes this fix and others.

          <poke> Thanks

          ofaaland Olaf Faaland added a comment - <poke> Thanks

          Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message

          ashehata Amir Shehata (Inactive) added a comment - Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message
          ofaaland Olaf Faaland added a comment -

          Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging.

          ofaaland Olaf Faaland added a comment - Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging.

          would you be able to turn on net logging

          lctl set_param debug=+"net neterror"

          and capture the logs when you reproduce this message. I have made some changes in this area as part of LU-11477. I want to see if my changes there resolve this particular problem. I can then port it to 2.12.

          ashehata Amir Shehata (Inactive) added a comment - would you be able to turn on net logging lctl set_param debug=+"net neterror" and capture the logs when you reproduce this message. I have made some changes in this area as part of LU-11477 . I want to see if my changes there resolve this particular problem. I can then port it to 2.12.

          People

            ashehata Amir Shehata (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: