[LU-16223] Setting debug_peer_on_timeout=1 can cause kernel NULL pointer deref Created: 07/Oct/22 Updated: 24/Nov/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Åke Sandgren | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Setting debug_peer_on_timeout=1 on a client and then rebooting a lnet-router causes this to happen: === [Fri Oct 7 15:50:28 2022] LNetError: 246:0:(o2iblnd_cb.c:3044:kiblnd_rejected()) 172.27.243.18@o2ib240 rejected: o2iblnd fatal error ===
The client is running DDN lustre 2.12.8-ddn9 but I suspect this will be a problem for upstream too. |
| Comments |
| Comment by Lukasz Flis [ 24/Nov/22 ] |
|
Hi, I can confirm the same issue in lustre-2.15.0_RC2_22_g4d93fd7 Story:
[13244209.366690] BUG: unable to handle kernel NULL pointer dereference at 0000000000000034 [13244209.376794] PGD 0 P4D 0 [13244209.381400] Oops: 0000 [#1] SMP NOPTI [13244209.387109] CPU: 4 PID: 1055796 Comm: ll_ost00_018 Kdump: loaded Tainted: P OE --------- - - 4.18.0-348.7.1.el8_5.x86_64 #1 [13244209.403303] Hardware name: HPE ProLiant DL325 Gen10 Plus/ProLiant DL325 Gen10 Plus, BIOS A43 12/03/2021 [13244209.414802] RIP: 0010:lnet_debug_peer+0xad/0x270 [lnet] [13244209.422073] Code: a0 dc c0 89 d0 f7 e7 c1 ea 05 39 d6 72 15 81 bb 04 01 00 00 de c0 aa 15 48 c7 c0 6b a0 dc c0 4c 0f 44 e0 4c 8b 9b b8 00 00 00 <44> 8b 49 34 48 b8 00 04 00 00 9d 0f 00 00 48 c7 05 4a d7 04 00 18 [13244209.444733] RSP: 0018:ffffac5b29b0fa90 EFLAGS: 00010246 [13244209.451981] RAX: ffff9ce28c88d200 RBX: ffff9ce28c88ce00 RCX: 0000000000000000 [13244209.461141] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [13244209.470280] RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000 [13244209.479386] R10: ebc0de0100000010 R11: 0000000000000000 R12: ffffffffc0dca068 [13244209.488464] R13: 0000000000000001 R14: 0000000000101c34 R15: ffff9ce7e0190948 [13244209.497518] FS: 0000000000000000(0000) GS:ffff9d54feb00000(0000) knlGS:0000000000000000 [13244209.507541] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13244209.515183] CR2: 0000000000000034 CR3: 00000001a2008000 CR4: 0000000000350ee0 [13244209.524211] Call Trace: [13244209.528568] ptlrpc_expire_one_request+0x3ca/0x560 [ptlrpc] [13244209.536055] ptlrpc_check_set+0xa13/0x1fe0 [ptlrpc] [13244209.542817] ptlrpc_set_wait+0x27c/0x730 [ptlrpc] [13244209.549303] ? finish_wait+0x80/0x80 [13244209.554692] ? ldlm_work_revoke_ast_lock+0x1b0/0x1b0 [ptlrpc] [13244209.562255] ldlm_run_ast_work+0xda/0x3f0 [ptlrpc] [13244209.568829] ldlm_handle_conflict_lock+0x6a/0x2e0 [ptlrpc] [13244209.576078] ldlm_lock_enqueue+0x2cd/0xa80 [ptlrpc] [13244209.582691] ldlm_handle_enqueue0+0x634/0x1530 [ptlrpc] [13244209.589634] tgt_enqueue+0xa4/0x210 [ptlrpc] [13244209.595589] tgt_request_handle+0xc93/0x1a40 [ptlrpc] [13244209.602315] ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc] [13244209.609720] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc] [13244209.617206] ptlrpc_main+0xc06/0x1560 [ptlrpc] [13244209.623270] ? ptlrpc_wait_event+0x590/0x590 [ptlrpc] [13244209.629850] kthread+0x116/0x130 [13244209.634562] ? kthread_flush_work_fn+0x10/0x10 [13244209.640479] ret_from_fork+0x22/0x40 [13244209.645485] Modules linked in: binfmt_misc osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) netconsole libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) dm_service_time intel_rapl_msr intel_rapl_common edac_mce_amd amd_energy kvm_amd kvm irqbypass rapl pcspkr mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm hpilo hpwdt k10temp i2c_piix4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) ipmi_ssif icp(POE) zcommon(POE) znvpair(POE) spl(OE) acpi_ipmi ipmi_si ipmi_devintf ses enclosure wmi ipmi_msghandler acpi_tad acpi_power_meter acpi_cpufreq sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel bnxt_en ccp smartpqi mpt3sas(OE) raid_class scsi_transport_sas ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw tls pci_hyperv_intf [13244209.645543] dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod [13244209.750453] CR2: 0000000000000034 Best Regards Lukasz Flis
|