[LU-17247] BUG: unable to handle kernel NULL pointer dereference in kiblnd_passive_connect Created: 01/Nov/23  Updated: 03/Nov/23  Resolved: 03/Nov/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.16.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Shuichi Ihara Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

master, RHEL8.7


Issue Links:
Duplicate
duplicates LU-17071 o2iblnd: Oops caused by IBLND_REJECT_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

server crashed due to NULL pointer dereference in kiblnd_passive_connect below

[14161.702631] libcfs: HW NUMA nodes: 1, HW CPU cores: 24, npartitions: 4
[14161.705274] alg: No test for adler32 (adler32-zlib)
[14162.456545] Key type ._llcrypt registered
[14162.457357] Key type .llcrypt registered
[14162.484133] Lustre: Lustre: Build Version: 2.15.58_109_g40074d3
[14162.540341] LNet: Using FastReg for registration
[14162.750736] LNet: Added LNI 10.0.11.209@o2ib12 [32/1024/0/180]
[14162.950680] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
[14162.951989] PGD 0 
[14162.952520] Oops: 0000 [#1] SMP NOPTI
[14162.953250] CPU: 22 PID: 201160 Comm: kworker/22:4 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-425.13.1.el8_lustre.ddn17.x86_64 #1
[14162.955184] Hardware name: DDN SFA400NVX2E, BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[14162.956667] Workqueue: ib_cm cm_work_handler [ib_cm]
[14162.957565] RIP: 0010:kiblnd_passive_connect+0x1395/0x1620 [ko2iblnd]
[14162.958644] Code: c7 05 63 81 01 00 00 01 00 00 e8 26 03 f4 ff 48 89 df ba 40 00 00 00 48 89 c6 e8 06 10 f4 ff 45 8b b4 24 24 01 00 00 49 89 c7 <48> 8b 04 25 40 00 00 00 48 8d 58 38 e8 fa 02 f4 ff 48 89 df ba 40
[14162.961535] RSP: 0018:ff7a599b4dca79a0 EFLAGS: 00010246
[14162.962473] RAX: ffffffffc1038f00 RBX: 0005001614010bd1 RCX: 0000000000000000
[14162.963534] LNet: Added LNI 20.1.11.209@o2ib22 [32/1024/0/180]
[14162.963649] RDX: ffffffffc1038f12 RSI: 0000000000000000 RDI: 0000000000000000
[14162.965863] RBP: ff36491ca4dbcc00 R08: 0000000000000001 R09: 0000000000000000
[14162.967015] R10: ffffffffc1038f40 R11: ffffffffc1038f12 R12: ff364925b2ba2a00
[14162.968167] R13: ff36492daa67a5b0 R14: 0000000000000000 R15: ffffffffc1038f00
[14162.969313] FS:  0000000000000000(0000) GS:ff36493e31b80000(0000) knlGS:0000000000000000
[14162.970594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14162.971560] CR2: 0000000000000040 CR3: 0000000f8bc10003 CR4: 0000000000771ee0
[14162.972711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14162.973846] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[14162.974976] PKRU: 55555554
[14162.975553] Call Trace:
[14162.976085]  ? xas_store+0x56/0x5a0
[14162.976755]  kiblnd_cm_callback+0x3d7/0x1e90 [ko2iblnd]
[14162.977639]  ? __xa_alloc_cyclic+0x49/0xe0
[14162.978375]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
[14162.979227]  cma_ib_req_handler+0x7d1/0x1260 [rdma_cm]
[14162.980090]  ? update_group_capacity+0x25/0x220
[14162.980872]  cm_process_work+0x22/0xf0 [ib_cm]
[14162.981638]  cm_req_handler+0x7f1/0xf40 [ib_cm]
[14162.982416]  cm_work_handler+0x79c/0xf30 [ib_cm]
[14162.983198]  ? __switch_to+0x10c/0x450
[14162.983872]  ? finish_task_switch+0xaf/0x2e0
[14162.984607]  process_one_work+0x1a7/0x360
[14162.985300]  ? create_worker+0x1a0/0x1a0
[14162.985979]  worker_thread+0x30/0x390
[14162.986623]  ? create_worker+0x1a0/0x1a0
[14162.987292]  kthread+0x10b/0x130
[14162.987874]  ? set_kthread_struct+0x50/0x50
[14162.988577]  ret_from_fork+0x1f/0x40
[14162.989205] Modules linked in: ko2iblnd(OE) ptlrpc(OE+) obdclass(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common nfit libnvdimm kvm_intel kvm irqbypass iTCO_wdt ppdev iTCO_vendor_support crct10dif_pclmul crc32_pclmul bochs drm_vram_helper drm_ttm_helper ghash_clmulni_intel ttm rapl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr i2c_i801 drm joydev lpc_ich i6300esb parport_pc parport ext4 mbcache jbd2 sr_mod sd_mod cdrom t10_pi sg mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) pci_hyperv_intf ahci tls libahci psample mlxdevm(OE) virtio_net libata bnxt_en crc32c_intel net_failover serio_raw virtio_blk mlx_compat(OE) virtio_scsi failover dm_mirror dm_region_hash dm_log dm_mod [last unloaded: libcfs]


 Comments   
Comment by Shuichi Ihara [ 01/Nov/23 ]

Although the logical interfaces and alias are not used, I've applied patch https://review.whamcloud.com/#/c/fs/lustre-release/+/52894/ against master

Comment by Andreas Dilger [ 03/Nov/23 ]

Although the logical interfaces and alias are not used, I've applied patch https://review.whamcloud.com/#/c/fs/lustre-release/+/52894/ against master

Shuichi, does that patch cause the crash (seems unlikely, given the patch is very small)?

If that is the only patch applied, it looks like this would be based on commit v2_15_58-108-g345a2497d0 "LU-5134 utils: Add parallel option to lctl set_param"?

Comment by Shuichi Ihara [ 03/Nov/23 ]

I've confirmed that https://review.whamcloud.com/c/fs/lustre-release/+/52202 from LU-17071 solved problem. 
So, LU-17247 should duplicate LU-17071

Generated at Sat Feb 10 03:33:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.