[LU-17496] LNet teardown could retry cleanup before asserting Created: 01/Feb/24 Updated: 01/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Shaun Tancheff | Assignee: | Shaun Tancheff |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
LNet teardown could retry cleanup before asserting. We see this assert show up in sanity-lnet/220 Excerpted from https://testing.whamcloud.com/test_logs/087d6d3d-deca-4831-9337-30fae7338f25/show_text [17841.535068] Lustre: DEBUG MARKER: == sanity-lnet test 220: Add routes w/default options - check aliveness ========================================================== 23:19:06 (1706570346) [17841.835785] Lustre: DEBUG MARKER: /usr/sbin/lustre_rmmod [17842.279424] Key type lgssc unregistered [17842.319629] LNetError: 6049:0:(lib-md.c:281:lnet_assert_handler_unused()) ASSERTION( md->md_handler != handler ) failed: [17842.320935] LNetError: 6049:0:(lib-md.c:281:lnet_assert_handler_unused()) LBUG [17842.321757] Pid: 6049, comm: lnet_discovery 5.14.0-284.30.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 25 09:13:12 EDT 2023 [17842.322978] Call Trace TBD: [17842.323365] Kernel panic - not syncing: LBUG [17842.323894] CPU: 0 PID: 6049 Comm: lnet_discovery Kdump: loaded Tainted: G OE -------- --- 5.14.0-284.30.1.el9_2.x86_64 #1 [17842.325176] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17842.325838] Call Trace: [17842.326178] <TASK> [17842.326492] dump_stack_lvl+0x34/0x48 [17842.326997] panic+0xf4/0x2c6 [17842.327399] ? lnet_discovery_event_reply+0xbc0/0xbc0 [lnet] [17842.328223] lbug_with_loc.cold+0x18/0x18 [libcfs] [17842.328869] lnet_assert_handler_unused+0x9c/0xd0 [lnet] [17842.329506] lnet_peer_discovery+0x997/0xaf0 [lnet] [17842.330111] ? cpuacct_percpu_seq_show+0x10/0x10 [17842.330680] ? lnet_peer_data_present+0x580/0x580 [lnet] [17842.331323] kthread+0xd9/0x100 [17842.331734] ? kthread_complete_and_exit+0x20/0x20 [17842.332298] ret_from_fork+0x22/0x30 [17842.332769] </TASK> We could attempt to retry the clean pass a couple of times before finally asserting. |
| Comments |
| Comment by Gerrit Updater [ 01/Feb/24 ] |
|
"Shaun Tancheff <shaun.tancheff@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53876 |