[LU-11981] lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2) Created: 20/Feb/19  Updated: 30/Jan/20  Resolved: 20/Dec/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.4

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Amir Shehata (Inactive)
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

clients and routers: Lustre 2.12.0_1.chaos
lustre servers: Lustre 2.10.6_2.chaos

Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64
Clients OmniPath <> routers <> Servers mlx5


Attachments: File dk.opal190.1550688817.txt.gz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Over the span of about 20 minutes, routers reported the following in their console logs:
2019-02-19 10:05:02 [330235.278414] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (0, 2)
2019-02-19 10:05:02 [330235.294305] LNetError: 33048:0:(lib-msg.c:811:lnet_is_health_check()) Skipped 1646 previous similar messages

While the lustre servers were being rebooted.
(0, 2) corresponds to:
msg->msg_ev.status == 0 (success)
msg->msg_health_status == 2 (LNET_MSG_STATUS_LOCAL_DROPPED)

See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.



 Comments   
Comment by Amir Shehata (Inactive) [ 20/Feb/19 ]

would you be able to turn on net logging

lctl set_param debug=+"net neterror"

and capture the logs when you reproduce this message. I have made some changes in this area as part of LU-11477. I want to see if my changes there resolve this particular problem. I can then port it to 2.12.

Comment by Olaf Faaland [ 20/Feb/19 ]

Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging.

Comment by Amir Shehata (Inactive) [ 06/Mar/19 ]

Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message

Comment by Olaf Faaland [ 25/Nov/19 ]

<poke> Thanks

Comment by Amir Shehata (Inactive) [ 04/Dec/19 ]

there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket.

The particular patch which resolves this issue is:

LU-12199 lnet: Ensure md is detached when msg is not committed

However, I would suggest moving to 2.12.3 which includes this fix and others.

Comment by Olaf Faaland [ 11/Dec/19 ]

Hmm.  We're seeing

2019-12-11 10:01:08 [  972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0)

with Lustre 2.12.3.

I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that?

Comment by Amir Shehata (Inactive) [ 12/Dec/19 ]

No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:

LU-11477 lnet: handle health for incoming messages

I'll create a patch just to change this log level to debug and push it on b2_12

Comment by Gerrit Updater [ 12/Dec/19 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001
Subject: LU-11981 lnet: clean up error message
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 96a278829b7a375e05f23b538f8db876a68caa71

Comment by Gerrit Updater [ 20/Dec/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/
Subject: LU-11981 lnet: clean up error message
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: f549927ea633b910a8c788fa970af742b3bf10c1

Comment by Peter Jones [ 20/Dec/19 ]

Landed for 2.12.4. Not needed on master

Generated at Sat Feb 10 02:48:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.