Details
-
Bug
-
Resolution: Fixed
-
Medium
-
None
-
None
-
3
-
9223372036854775807
Description
Client mounted over o2ib, then unmounted.
No module reload
Discovery disabled, peers deleted manually.
kfi and tcp networks added to client and server.
Client re-mount over o2ib:
mount -t lustre -o retry=2,network=o2ib 172.18.2.8@o2ib:172.18.2.7@o2ib:/lustre /mnt/lustre
Client mount over kfi:
mount -t lustre -o retry=2,network=kfi1 16@kfi1:/lustre /mnt/lustre-kfi1
Client BUG:
[168432.962201] Lustre: 771313:0:(client.c:131:ptlrpc_uuid_to_connection()) cannot find peer 172.18.2.7@o2ib! [168432.971854] LustreError: 771313:0:(ldlm_lib.c:573:client_obd_setup()) can't add initial connection [168432.980907] BUG: unable to handle kernel paging request at 000000000000121a [168432.987948] PGD 0 P4D 0 [168432.990570] Oops: 0000 [#1] SMP NOPTI [168432.994315] CPU: 23 PID: 771313 Comm: llog_process_th Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.36.1.el8_10.x86_64 #1 [168433.007165] Hardware name: Viking Enterprise Solutions VSSEP1EA/VSSEP1EA, BIOS 10.09.02 10/26/2020 [168433.016196] RIP: 0010:mdc_device_free+0xb/0x160 [mdc] [168433.021367] Code: 48 c7 05 c4 ff 01 00 00 00 00 00 e8 0f f9 6c ff e9 46 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 53 48 8b 6e 28 <66> 83 bd 1a 12 00 00 00 75 20 48 89 f3 48 85 f6 75 48 31 ff e8 4c [168433.040191] RSP: 0018:ffffa5072d30fba8 EFLAGS: 00010282 [168433.045502] RAX: 00000000ffffff9b RBX: ffffffffffffff9b RCX: 0000000000000000 [168433.052711] RDX: 0000000000000000 RSI: ffff97607dc58880 RDI: ffffa5072d30fc18 [168433.059922] RBP: 0000000000000000 R08: 0000000080000000 R09: 0000000000000000 [168433.067140] R10: 00000000ffffff9b R11: ffff975f742e43fd R12: ffff975f39cb8000 [168433.074349] R13: ffffa5072d30fc18 R14: 0000000000000000 R15: 0000000000000000 [168433.081560] FS: 0000000000000000(0000) GS:ffff977ead2c0000(0000) knlGS:0000000000000000 [168433.089731] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [168433.095556] CR2: 000000000000121a CR3: 0000002a9e210000 CR4: 0000000000350ee0 [168433.102775] Call Trace: [168433.105317] ? __die_body+0x1a/0x60 [168433.108900] ? no_context+0x1ba/0x3f0 [168433.112653] ? __bad_area_nosemaphore+0x157/0x180 [168433.117444] ? do_page_fault+0x37/0x12d [168433.121370] ? page_fault+0x1e/0x30 [168433.124950] ? mdc_device_free+0xb/0x160 [mdc] [168433.129488] mdc_device_alloc+0x1f5/0x240 [mdc] [168433.134111] obd_setup+0x224/0x470 [obdclass] [168433.138738] class_setup+0x5b7/0x760 [obdclass] [168433.143400] class_process_config+0x120a/0x20b0 [obdclass] [168433.149023] class_config_llog_handler+0x727/0x11c0 [obdclass] [168433.154986] llog_process_thread+0xd79/0x1b30 [obdclass] [168433.160426] ? llog_validate+0x370/0x370 [obdclass] [168433.165434] llog_process_thread_daemonize+0x70/0x90 [obdclass] [168433.171482] kthread+0x134/0x150 [168433.174804] ? set_kthread_struct+0x50/0x50 [168433.179073] ret_from_fork+0x35/0x40
RCA from sonnet:
NULL Pointer Dereference Bug (Crash) The real kernel BUG is in mdc_device_alloc() (mdc_dev.c:1741): when mdc_setup() fails, it calls mdc_device_free(env, d) as cleanup — but d->ld_obd is still NULL at this point. The framework (obd_setup() in obd_class.h:584) only sets dev->ld_obd = obd after ldto_device_alloc() returns successfully. Since the error occurs inside mdc_device_alloc(), this assignment never happens. mdc_device_free() then does: struct obd_device *obd = lu->ld_obd; // NULL struct client_obd *cli = &obd->u.cli; // NULL + offset LASSERT(cli->cl_mod_rpcs_in_flight == 0); // → fault at 0x0 + 0x121a = 0x121a CR2=0x121a is exactly offsetof(struct obd_device, u.cli.cl_mod_rpcs_in_flight) with a NULL base pointer. RBP=0 in the register dump confirms it. osc_device_alloc() has the identical bug (osc/osc_dev.c:206). Fix In mdc_device_alloc() (and symmetrically osc_device_alloc()), set d->ld_obd = obd before calling mdc_setup()/ osc_setup(), following the pattern of server-side drivers (lod, ofd, osp all set ld_obd in their own alloc functions): obd->obd_lu_dev = d; d->ld_obd = obd; // add this rc = mdc_setup(obd, cfg);