[LU-16223] Setting debug_peer_on_timeout=1 can cause kernel NULL pointer deref Created: 07/Oct/22  Updated: 24/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Åke Sandgren Assignee: WC Triage
Resolution: Unresolved Votes: 1
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Setting debug_peer_on_timeout=1 on a client and then rebooting a lnet-router causes this to happen:

===

[Fri Oct  7 15:50:28 2022] LNetError: 246:0:(o2iblnd_cb.c:3044:kiblnd_rejected()) 172.27.243.18@o2ib240 rejected: o2iblnd fatal error
[Fri Oct  7 15:50:28 2022] Lustre: 3803:0:(client.c:2182:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1665150627/real 1665150627]  req@00000000e4b7f0d8 x1746036694498560/t0(0) o400->stor10-MDT0003-mdc-ffff8a3a1e709800@172.27.1.33@o2ib1:12/10 lens 224/224 e 0 to 1 dl 1665151070 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[Fri Oct  7 15:50:28 2022] BUG: kernel NULL pointer dereference, address: 000000000000003

===

 

The client is running DDN lustre 2.12.8-ddn9 but I suspect this will be a problem for upstream too.



 Comments   
Comment by Lukasz Flis [ 24/Nov/22 ]

Hi,

I can confirm the same issue in lustre-2.15.0_RC2_22_g4d93fd7

Story:

  • enabled  debug_peer_on_timeout via sysfs in order to debug network timeouts occuring over long-range link
  • triggered eviction due to network error, got Panic

 

[13244209.366690] BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
[13244209.376794] PGD 0 P4D 0 
[13244209.381400] Oops: 0000 [#1] SMP NOPTI
[13244209.387109] CPU: 4 PID: 1055796 Comm: ll_ost00_018 Kdump: loaded Tainted: P           OE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[13244209.403303] Hardware name: HPE ProLiant DL325 Gen10 Plus/ProLiant DL325 Gen10 Plus, BIOS A43 12/03/2021
[13244209.414802] RIP: 0010:lnet_debug_peer+0xad/0x270 [lnet]
[13244209.422073] Code: a0 dc c0 89 d0 f7 e7 c1 ea 05 39 d6 72 15 81 bb 04 01 00 00 de c0 aa 15 48 c7 c0 6b a0 dc c0 4c 0f 44 e0 4c 8b 9b b8 00 00 00 <44> 8b 49 34 48 b8 00 04 00 00 9d 0f 00 00 48 c7 05 4a d7 04 00 18
[13244209.444733] RSP: 0018:ffffac5b29b0fa90 EFLAGS: 00010246
[13244209.451981] RAX: ffff9ce28c88d200 RBX: ffff9ce28c88ce00 RCX: 0000000000000000
[13244209.461141] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[13244209.470280] RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
[13244209.479386] R10: ebc0de0100000010 R11: 0000000000000000 R12: ffffffffc0dca068
[13244209.488464] R13: 0000000000000001 R14: 0000000000101c34 R15: ffff9ce7e0190948
[13244209.497518] FS:  0000000000000000(0000) GS:ffff9d54feb00000(0000) knlGS:0000000000000000
[13244209.507541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13244209.515183] CR2: 0000000000000034 CR3: 00000001a2008000 CR4: 0000000000350ee0
[13244209.524211] Call Trace:
[13244209.528568]  ptlrpc_expire_one_request+0x3ca/0x560 [ptlrpc]
[13244209.536055]  ptlrpc_check_set+0xa13/0x1fe0 [ptlrpc]
[13244209.542817]  ptlrpc_set_wait+0x27c/0x730 [ptlrpc]
[13244209.549303]  ? finish_wait+0x80/0x80
[13244209.554692]  ? ldlm_work_revoke_ast_lock+0x1b0/0x1b0 [ptlrpc]
[13244209.562255]  ldlm_run_ast_work+0xda/0x3f0 [ptlrpc]
[13244209.568829]  ldlm_handle_conflict_lock+0x6a/0x2e0 [ptlrpc]
[13244209.576078]  ldlm_lock_enqueue+0x2cd/0xa80 [ptlrpc]
[13244209.582691]  ldlm_handle_enqueue0+0x634/0x1530 [ptlrpc]
[13244209.589634]  tgt_enqueue+0xa4/0x210 [ptlrpc]
[13244209.595589]  tgt_request_handle+0xc93/0x1a40 [ptlrpc]
[13244209.602315]  ? ptlrpc_nrs_req_get_nolock0+0xfb/0x1f0 [ptlrpc]
[13244209.609720]  ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[13244209.617206]  ptlrpc_main+0xc06/0x1560 [ptlrpc]
[13244209.623270]  ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
[13244209.629850]  kthread+0x116/0x130
[13244209.634562]  ? kthread_flush_work_fn+0x10/0x10
[13244209.640479]  ret_from_fork+0x22/0x40
[13244209.645485] Modules linked in: binfmt_misc osp(OE) ofd(OE) lfsck(OE) ost(OE) mgc(OE) osd_zfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) netconsole libcfs(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_umad(OE) dm_service_time intel_rapl_msr intel_rapl_common edac_mce_amd amd_energy kvm_amd kvm irqbypass rapl pcspkr mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm hpilo hpwdt k10temp i2c_piix4 zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zavl(POE) ipmi_ssif icp(POE) zcommon(POE) znvpair(POE) spl(OE) acpi_ipmi ipmi_si ipmi_devintf ses enclosure wmi ipmi_msghandler acpi_tad acpi_power_meter acpi_cpufreq sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel bnxt_en ccp smartpqi mpt3sas(OE) raid_class scsi_transport_sas ib_ipoib(OE) ib_cm(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) psample mlxfw tls pci_hyperv_intf
[13244209.645543]  dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod
[13244209.750453] CR2: 0000000000000034

Best Regards

Lukasz Flis

 

Generated at Sat Feb 10 03:25:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.