Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
Lustre 2.18.0
-
None
-
3
-
9223372036854775807
Description
Looks like LU-18162 series of patches that converted various components to LU devices introduced a crash on the cleanup patch when device allocation fails for any reason.
This was noticed when a bug introuces an attempt to double-register a sysfs name for mdc in particular.
Trivial reproduction with this patch:
diff --git a/lustre/ldlm/ldlm_resource.c b/lustre/ldlm/ldlm_resource.c
index ff81d55377..2cac972b25 100644
--- a/lustre/ldlm/ldlm_resource.c
+++ b/lustre/ldlm/ldlm_resource.c
@@ -1023,6 +1023,11 @@ struct ldlm_namespace *ldlm_namespace_new(struct obd_device *obd, char *name,
ns->ns_lock_cache_policy = LDLM_LOCK_CACHE_LFRU;
ns->ns_lock_cache_ops = &ldlm_lfru_cache_ops;
+ if (!strncmp(name, "lustre-MDT0000-mdc-", 19)) {
+ CERROR("injected sysfs registration failure for %s\n", name);
+ GOTO(out_hash, rc = -17);
+ }
+
rc = ldlm_namespace_sysfs_register(ns);
if (rc) {
CERROR("%s: cannot initialize ns sysfs: rc = %d\n", name, rc);
This crashes like this:
[ 3782.402249] BUG: unable to handle page fault for address: 000000000000142a [ 3782.402254] #PF: supervisor read access in kernel mode [ 3782.402257] #PF: error_code(0x0000) - not-present page [ 3782.402272] PGD 0 P4D 0 [ 3782.402284] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 3782.402290] CPU: 0 PID: 102409 Comm: llog_process_th Kdump: loaded Tainted: G OE ------- --- 5.14.0rocky96-debug #3 [ 3782.402296] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17. 0-8.fc42 06/10/2025 [ 3782.402300] RIP: 0010:mdc_device_free+0xc/0x160 [mdc] [ 3782.402375] Code: fd 00 00 00 00 04 00 e8 22 49 62 ff 48 c7 c7 60 37 6b c1 e8 b6 1a 62 ff 45 31 e4 eb bf 90 0f 1f 44 00 00 41 54 55 4c 8b 66 28 <66> 41 83 bc 24 2a 14 00 00 00 75 21 48 89 f5 48 85 f6 75 49 31 ff [ 3782.402393] RSP: 0018:ffffb2cbc9f9fb80 EFLAGS: 00010282 [ 3782.402401] RAX: 00000000ffffffef RBX: ffff94f3c433ca10 RCX: ffff94f3c6250000 [ 3782.402406] RDX: 0000000000000000 RSI: ffff94f455f3a300 RDI: ffffb2cbc9f9fbf8 [ 3782.402409] RBP: 00000000ffffffef R08: ffff94f3cc79e000 R09: 0000000080080006 [ 3782.402412] R10: 00000000ffffffef R11: 0000000000000000 R12: 0000000000000000 [ 3782.402415] R13: ffffb2cbc9f9fbf8 R14: ffffffffc16b3f20 R15: ffff94f3c433cdb8 [ 3782.402432] FS: 0000000000000000(0000) GS:ffff94f502000000(0000) knlGS:0000000000000000 [ 3782.402438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3782.402442] CR2: 000000000000142a CR3: 0000000086862006 CR4: 0000000000172ef0 [ 3782.402453] Call Trace: [ 3782.402469] <TASK> [ 3782.402476] ? show_trace_log_lvl+0x1e1/0x31b [ 3782.402495] ? show_trace_log_lvl+0x1e1/0x31b [ 3782.402516] ? mdc_device_alloc+0x16c/0x260 [mdc] [ 3782.402587] ? __die_body.cold+0x8/0xd [ 3782.402599] ? page_fault_oops+0xac/0x150 [ 3782.402609] ? kernelmode_fixup_or_oops+0x84/0x110 [ 3782.402625] ? exc_page_fault+0x6f/0x190 [ 3782.402640] ? asm_exc_page_fault+0x22/0x30 [ 3782.402658] ? mdc_device_free+0xc/0x160 [mdc] [ 3782.402729] mdc_device_alloc+0x16c/0x260 [mdc] [ 3782.402793] obd_setup+0x195/0x460 [obdclass] [ 3782.403137] class_setup+0x607/0x7c0 [obdclass] [ 3782.403397] class_process_config+0x1837/0x1e50 [obdclass] [ 3782.403674] ? class_config_llog_handler+0x64a/0x1330 [obdclass] [ 3782.403943] ? lustre_cfg_init+0x88/0x1a0 [obdclass] [ 3782.404185] class_config_llog_handler+0x798/0x1330 [obdclass] [ 3782.404466] llog_process_thread+0xda5/0x1b20 [obdclass] [ 3782.404722] ? llog_validate+0x380/0x380 [obdclass] [ 3782.404983] llog_process_thread_daemonize+0x6d/0x90 [obdclass] [ 3782.405222] kthread+0xf3/0x120 [ 3782.405267] ? kthread_park+0x90/0x90
where 000000000000142a in particular is cli->cl_mod_rpcs_in_flight coming from
+static struct lu_device *mdc_device_free(const struct lu_env *env,
+ struct lu_device *lu)
+{
+ struct obd_device *obd = lu->ld_obd;
+ struct client_obd *cli = &obd->u.cli;
+ struct osc_device *osc = lu2osc_dev(lu);
+
+ LASSERT(cli->cl_mod_rpcs_in_flight == 0);
(gdb) p/x &((struct obd_device *)0)->u.cli.cl_mod_rpcs_in_flight $4 = 0x14a2
this tells us that the obd backpointer is NULL on such an error path when registration fails, which is not exactly surprising.
Real crashes in maloo could be seen here:
https://testing.whamcloud.com/test_sets/d631fd20-2d64-44cf-ba8e-0beb74d2ae96
https://testing.whamcloud.com/test_sets/785a27cf-8d9f-49a1-a812-b94a477e3cbe
https://testing.whamcloud.com/test_sets/8c7da0c8-c742-472b-8379-1349d3499372
and so on.
Changing the debug patch to test for OST0000-osc causes a crash in osc_cleanup_common()