[LU-12628] LNetError: 44030:0:(lib-msg.c:735:lnet_health_check()) ASSERTION( msg->msg_tx_committed ) failed: Created: 03/Aug/19  Updated: 16/Aug/19  Resolved: 16/Aug/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: Chris Horn
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

It's possible for a message with

msg_rx_committed

to reach the resend block in lnet_health_check() and trip this assert. The assert should be changed to an if-statement and we should simply return -1 to finalize the message.

[846578.191198] LustreError: 65272:0:(brw_test.c:389:brw_server_rpc_done()) Skipped 12 previous similar messages
[846578.191789] LNetError: 44030:0:(lib-msg.c:735:lnet_health_check()) ASSERTION( msg->msg_tx_committed ) failed: 
[846578.191793] LNetError: 44030:0:(lib-msg.c:735:lnet_health_check()) LBUG
[846578.191795] Pid: 44030, comm: kiblnd_sd_01_00 3.10.0-693.21.1.x3.2.152.x86_64 #1 SMP Mon Feb 25 06:44:43 PST 2019
[846578.191795] Call Trace:
[846578.191824]  [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
[846578.191856]  [<ffffffffc0a3f7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[846578.191868]  [<ffffffffc0a3f87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[846578.191915]  [<ffffffffc0ad2c7e>] lnet_health_check+0x9ae/0x9e0 [lnet]
[846578.191930]  [<ffffffffc0ad2dc5>] lnet_finalize+0x115/0x9c0 [lnet]
[846578.191949]  [<ffffffffc0b8278d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[846578.191958]  [<ffffffffc0b8dc5d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
[846578.191965]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[846578.191972]  [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0
[846578.192031]  [<ffffffffffffffff>] 0xffffffffffffffff
[846578.192032] Kernel panic - not syncing: LBUG
[846578.192037] CPU: 13 PID: 44030 Comm: kiblnd_sd_01_00 Tainted: P           OE  ------------   3.10.0-693.21.1.x3.2.152.x86_64 #1
[846578.192038] Hardware name: Seagate SATI-TL/Type2 - Board Product Sati2, BIOS SATI-TL.v0046.0002 01/13/2015
[846578.192040] Call Trace:
[846578.192053]  [<ffffffff816b17c8>] dump_stack+0x19/0x1b
[846578.192057]  [<ffffffff816ab634>] panic+0xe8/0x21f
[846578.192072]  [<ffffffffc0a3f8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[846578.192087]  [<ffffffffc0ad2c7e>] lnet_health_check+0x9ae/0x9e0 [lnet]
[846578.192094]  [<ffffffff810eced2>] ? ktime_get_ts64+0x52/0xf0
[846578.192110]  [<ffffffffc0ad2dc5>] lnet_finalize+0x115/0x9c0 [lnet]
[846578.192119]  [<ffffffffc0b78b52>] ? kiblnd_pool_free_node+0x82/0x170 [ko2iblnd]
[846578.192126]  [<ffffffffc0b8278d>] kiblnd_tx_done+0x10d/0x3e0 [ko2iblnd]
[846578.192135]  [<ffffffffc0b8dc5d>] kiblnd_scheduler+0x89d/0x1180 [ko2iblnd]
[846578.192142]  [<ffffffff810cb0c5>] ? sched_clock_cpu+0x85/0xc0
[846578.192146]  [<ffffffff8102954d>] ? __switch_to+0xcd/0x500
[846578.192149]  [<ffffffff810c7c80>] ? wake_up_state+0x20/0x20
[846578.192156]  [<ffffffffc0b8d3c0>] ? kiblnd_cq_event+0x90/0x90 [ko2iblnd]
[846578.192159]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[846578.192161]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[846578.192164]  [<ffffffff816c455d>] ret_from_fork+0x5d/0xb0
[846578.192167]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Gerrit Updater [ 03/Aug/19 ]

Chris Horn (hornc@cray.com) uploaded a new patch: https://review.whamcloud.com/35686
Subject: LU-12628 lnet: avoid resend for msg_rx_committed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48e2757533e96e7b53096e1f3526578185600338

Comment by Chris Horn [ 16/Aug/19 ]

Fix for this issue has been rolled into LU-12402. Closing as duplicate of that ticket

Generated at Sat Feb 10 02:54:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.