Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11981

lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.4
    • Lustre 2.12.0
    • clients and routers: Lustre 2.12.0_1.chaos
      lustre servers: Lustre 2.10.6_2.chaos

      Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64
      Clients OmniPath <-> routers <-> Servers mlx5
    • 3
    • 9223372036854775807

    Description

      Over the span of about 20 minutes, routers reported the following in their console logs:
      2019-02-19 10:05:02 [330235.278414] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
      2019-02-19 10:05:02 [330235.294305] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1646 previous similar messages

      While the lustre servers were being rebooted.
      (0, 2) corresponds to:
      msg->msg_ev.status == 0 (success)
      msg->msg_health_status == 2 (LNET_MSG_STATUS_LOCAL_DROPPED)

      See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.

      Attachments

        Activity

          [LU-11981] lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2)
          ofaaland Olaf Faaland made changes -
          Labels Original: llnl topllnl New: llnl
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-27 [ JFC-27 ]
          pjones Peter Jones made changes -
          Link New: This issue is related to JFC-20 [ JFC-20 ]
          pjones Peter Jones made changes -
          Link Original: This issue is related to JFC-21 [ JFC-21 ]
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.12.4 [ 14690 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment -

          Landed for 2.12.4. Not needed on master

          pjones Peter Jones added a comment - Landed for 2.12.4. Not needed on master

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/
          Subject: LU-11981 lnet: clean up error message
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: f549927ea633b910a8c788fa970af742b3bf10c1

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/ Subject: LU-11981 lnet: clean up error message Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: f549927ea633b910a8c788fa970af742b3bf10c1
          pjones Peter Jones made changes -
          Link New: This issue is related to JFC-27 [ JFC-27 ]

          Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001
          Subject: LU-11981 lnet: clean up error message
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 96a278829b7a375e05f23b538f8db876a68caa71

          gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001 Subject: LU-11981 lnet: clean up error message Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 96a278829b7a375e05f23b538f8db876a68caa71

          No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:

          LU-11477 lnet: handle health for incoming messages

          I'll create a patch just to change this log level to debug and push it on b2_12

          ashehata Amir Shehata (Inactive) added a comment - No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch: LU-11477 lnet: handle health for incoming messages I'll create a patch just to change this log level to debug and push it on b2_12

          People

            ashehata Amir Shehata (Inactive)
            ofaaland Olaf Faaland
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: