Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0
-
clients and routers: Lustre 2.12.0_1.chaos
lustre servers: Lustre 2.10.6_2.chaos
Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64
Clients OmniPath <-> routers <-> Servers mlx5
-
3
-
9223372036854775807
Description
Over the span of about 20 minutes, routers reported the following in their console logs:
2019-02-19 10:05:02 [330235.278414] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
2019-02-19 10:05:02 [330235.294305] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1646 previous similar messages
While the lustre servers were being rebooted.
(0, 2) corresponds to:
msg->msg_ev.status == 0 (success)
msg->msg_health_status == 2 (LNET_MSG_STATUS_LOCAL_DROPPED)
See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.
No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:
LU-11477lnet: handle health for incoming messagesI'll create a patch just to change this log level to debug and push it on b2_12