[LU-17496] LNet teardown could retry cleanup before asserting Created: 01/Feb/24  Updated: 01/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Shaun Tancheff Assignee: Shaun Tancheff
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LNet teardown could retry cleanup before asserting.

We see this assert show up in sanity-lnet/220

Excerpted from https://testing.whamcloud.com/test_logs/087d6d3d-deca-4831-9337-30fae7338f25/show_text

[17841.535068] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 23:19:06 (1706570346)
[17841.835785] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod
[17842.279424] Key type lgssc unregistered
[17842.319629] LNetError: 6049:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: 
[17842.320935] LNetError: 6049:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG
[17842.321757] Pid: 6049, comm: lnet_discovery 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023
[17842.322978] Call Trace TBD:
[17842.323365] Kernel panic - not syncing: LBUG
[17842.323894] CPU: 0 PID: 6049 Comm: lnet_discovery Kdump: loaded Tainted: G           OE    --------  ---  5.14.0-284.30.1.el9_2.x86_64 #1
[17842.325176] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[17842.325838] Call Trace:
[17842.326178]  <TASK>
[17842.326492]  dump_stack_lvl+0x34/0x48
[17842.326997]  panic+0xf4/0x2c6
[17842.327399]  ? lnet_discovery_event_reply+0xbc0/0xbc0 [lnet]
[17842.328223]  lbug_with_loc.cold+0x18/0x18 [libcfs]
[17842.328869]  lnet_assert_handler_unused+0x9c/0xd0 [lnet]
[17842.329506]  lnet_peer_discovery+0x997/0xaf0 [lnet]
[17842.330111]  ? cpuacct_percpu_seq_show+0x10/0x10
[17842.330680]  ? lnet_peer_data_present+0x580/0x580 [lnet]
[17842.331323]  kthread+0xd9/0x100
[17842.331734]  ? kthread_complete_and_exit+0x20/0x20
[17842.332298]  ret_from_fork+0x22/0x30
[17842.332769]  </TASK>

We could attempt to retry the clean pass a couple of times before finally asserting.



 Comments   
Comment by Gerrit Updater [ 01/Feb/24 ]

"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53876
Subject: LU-17496 lnet: retry cleanup during shutdown
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 85806692eed4ee44385883f89caa11e76730e1b6

Generated at Sat Feb 10 03:35:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.