[LU-11981] lnet_is_health_check() Msg is in inconsistent state, don't perform health checking (0, 2) Created: 20/Feb/19 Updated: 30/Jan/20 Resolved: 20/Dec/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
clients and routers: Lustre 2.12.0_1.chaos Linux version 3.10.0-957.1.3.1chaos.ch6.x86_64 |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
Over the span of about 20 minutes, routers reported the following in their console logs: While the lustre servers were being rebooted. See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos. |
| Comments |
| Comment by Amir Shehata (Inactive) [ 20/Feb/19 ] |
|
would you be able to turn on net logging lctl set_param debug=+"net neterror" and capture the logs when you reproduce this message. I have made some changes in this area as part of |
| Comment by Olaf Faaland [ 20/Feb/19 ] |
|
Thanks Amir. See dk.opal190.1550688817.txt.gz attached. Look towards the end of the file, the beginning starts before I turned on net logging. |
| Comment by Amir Shehata (Inactive) [ 06/Mar/19 ] |
|
Sorry for the delay. It looks like there is a path in the code where the message is dropped but the message error status is not updated. However, the health status is updated. Leading to the inconsistent message you see. I'll update that path to correctly set the error status in the message |
| Comment by Olaf Faaland [ 25/Nov/19 ] |
|
<poke> Thanks |
| Comment by Amir Shehata (Inactive) [ 04/Dec/19 ] |
|
there are a series of patches which were back ported to b2_12 which resolve some issues including the one reported in this ticket. The particular patch which resolves this issue is:
However, I would suggest moving to 2.12.3 which includes this fix and others. |
| Comment by Olaf Faaland [ 11/Dec/19 ] |
|
Hmm. We're seeing 2019-12-11 10:01:08 [ 972.859958] LNetError: 28880:0:(lib-msg.c:820:lnet_is_health_check()) Msg is in inconsistent state, don't perform health checking (-125, 0) with Lustre 2.12.3. I see now that the values at the end of the message are (-125,0) which are different than the originally reported ones. And the system where I see this is Mellanox IB. New ticket for that? |
| Comment by Amir Shehata (Inactive) [ 12/Dec/19 ] |
|
No need to create a new ticket. I would say this scenario is expected, since -125 is ECANCELED. For that we do not bother to adjust the health. The problem is that this message is at error level. We had already changed it to debug level, but that change was part of a bigger patch:
I'll create a patch just to change this log level to debug and push it on b2_12 |
| Comment by Gerrit Updater [ 12/Dec/19 ] |
|
Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/37001 |
| Comment by Gerrit Updater [ 20/Dec/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37001/ |
| Comment by Peter Jones [ 20/Dec/19 ] |
|
Landed for 2.12.4. Not needed on master |