[LU-18275] o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.15.5
Labels:
- o2iblnd
Environment:
Lustre server 2.15.5 RoCE
Lustre MGS 2.15.5 RoCE
Lustre client 2.15.5 RoCE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64

【Network Card】
Client：
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

Server:
MLX CX6 2*100G RoCE v2 bond
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

【BUG Info】

Here is the following reproducer:

Mount lustre on a RoCE network
Construct lustre MDT(mdt0-10.255.153.128@o2ib) restart
Crash occurs on other lustre servers

Server call trace:

crash> bt
PID: 568423   TASK: ff4787632aa5c000  CPU: 5    COMMAND: "kworker/u40:0"
 #0 [ff7d728b15e6baa0] machine_kexec at ffffffff8fa6f353
 #1 [ff7d728b15e6baf8] __crash_kexec at ffffffff8fbbaa7a
 #2 [ff7d728b15e6bbb8] crash_kexec at ffffffff8fbbb9b1
 #3 [ff7d728b15e6bbd0] oops_end at ffffffff8fa2d831
 #4 [ff7d728b15e6bbf0] no_context at ffffffff8fa81cf3
 #5 [ff7d728b15e6bc48] __bad_area_nosemaphore at ffffffff8fa8206c
 #6 [ff7d728b15e6bc90] do_page_fault at ffffffff8fa82cf7
 #7 [ff7d728b15e6bcc0] page_fault at ffffffff906011ae
    [exception RIP: kiblnd_cm_callback+2653]
    RIP: ffffffffc0efe00d  RSP: ff7d728b15e6bd70  RFLAGS: 00010246
    RAX: 0000000000000007  RBX: ff7d728b15e6be08  RCX: 0000000000000000
    RDX: ff4787632aa5c000  RSI: ff7d728b15e6be08  RDI: ff47876095213c00
    RBP: ff47876095213c00   R8: 0000000000000000   R9: 006d635f616d6472
    R10: 8080808080808080  R11: 0000000000000000  R12: ff47876095213c00
    R13: 0000000000000000  R14: 0000000000000000  R15: ff47876095213de0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ff7d728b15e6bdd8] cma_cm_event_handler at ffffffffc04729a5 [rdma_cm]
 #9 [ff7d728b15e6be00] cma_netevent_work_handler at ffffffffc04786b5 [rdma_cm]
#10 [ff7d728b15e6be90] process_one_work at ffffffff8fb195e3
#11 [ff7d728b15e6bed8] worker_thread at ffffffff8fb197d0
#12 [ff7d728b15e6bf10] kthread at ffffffff8fb20e24
#13 [ff7d728b15e6bf50] ret_from_fork at ffffffff9060028f

Server kernel log:

[69106.143672] LustreError: 11-0: lustre-MDT000a-osp-MDT0007: operation mds_statfs to node 10.255.153.128@o2ib failed: rc = -107
[69106.143700] Lustre: lustre-OST0038-osc-MDT0007: Connection to lustre-OST0038 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69106.145053] LustreError: Skipped 29 previous similar messages
[69106.145060] Lustre: Skipped 203 previous similar messages
[69111.263490] Lustre: lustre-OST004e-osc-MDT0007: Connection to lustre-OST004e (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69111.263496] Lustre: Skipped 6 previous similar messages
[69112.506974] Lustre: lustre-OST0038-osc-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69112.506980] Lustre: Skipped 195 previous similar messages
[69116.383952] Lustre: lustre-MDT0000-lwp-MDT0007: Connection to lustre-MDT0000 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69116.383965] Lustre: Skipped 3 previous similar messages
[69122.528880] Lustre: lustre-MDT000a-lwp-OST0013: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69122.528885] Lustre: Skipped 3 previous similar messages
[69127.659951] Lustre: lustre-OST0043-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69127.659960] Lustre: Skipped 7 previous similar messages
[69138.672269] Lustre: lustre-OST002d-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69138.672275] Lustre: Skipped 3 previous similar messages
[69158.168201] Lustre: lustre-MDT000a-osp-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69158.168206] Lustre: Skipped 1 previous similar message
[69178.775546] Lustre: lustre-MDT0000-osp-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69178.775554] Lustre: Skipped 11 previous similar messages
[69178.805333] Lustre: lustre-OST0008: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.805386] Lustre: lustre-OST0013: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.805505] Lustre: lustre-OST001e: deleting orphan objects from 0x0:13846 to 0x0:13921
[69178.805729] Lustre: lustre-OST003f: deleting orphan objects from 0x0:13822 to 0x0:13857
[69178.806122] Lustre: lustre-OST004a: deleting orphan objects from 0x0:13855 to 0x0:13889
[69178.806130] Lustre: lustre-OST0055: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.807518] Lustre: lustre-OST0029: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.807537] Lustre: lustre-OST0034: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.838633] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0034: Delaying reintegration for qtype:2 until pending updates are flushed.
[69178.840301] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 1 previous similar message
[69180.358640] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0013: Delaying reintegration for qtype:2 until pending updates are flushed.
[69180.359161] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 2 previous similar messages
[69239.967382] BUG: unable to handle kernel NULL pointer dereference at 000000000000004c
[69239.968927] PGD 0 
[69239.969144] Oops: 0000 [#1] SMP NOPTI
[69239.969327] CPU: 5 PID: 568423 Comm: kworker/u40:0 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
[69239.969650] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
[69239.969792] Workqueue: rdma_cm cma_netevent_work_handler [rdma_cm]
[69239.969995] RIP: 0010:kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
[69239.970191] Code: 48 89 05 06 7f 01 00 c7 05 04 7f 01 00 00 00 02 02 48 c7 05 01 7f 01 00 d0 5e f1 c0 e8 ac 66 ee ff e9 b5 f7 ff ff 4c 8b 6f 08 <41> 8b 6d 4c f6 05 e5 d4 f0 ff 01 0f 84 8d 00 00 00 f6 05 dc d4 f0
[69239.970483] RSP: 0018:ff7d728b15e6bd70 EFLAGS: 00010246
[69239.970621] RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000
[69239.970763] RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00
[69239.970895] RBP: ff47876095213c00 R08: 0000000000000000 R09: 006d635f616d6472
[69239.971025] R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00
[69239.971155] R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0
[69239.971289] FS:  0000000000000000(0000) GS:ff47877d7f740000(0000) knlGS:0000000000000000
[69239.971425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69239.971570] CR2: 000000000000004c CR3: 0000001900e10004 CR4: 0000000000771ee0
[69239.971708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[69239.971841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[69239.971971] PKRU: 55555554
[69239.972099] Call Trace:
[69239.972267]  ? __die_body+0x1a/0x60
[69239.972451]  ? no_context+0x1ba/0x3f0
[69239.972583]  ? __bad_area_nosemaphore+0x16c/0x1c0
[69239.972705]  ? do_page_fault+0x37/0x12d
[69239.972826]  ? page_fault+0x1e/0x30
[69239.972971]  ? kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
[69239.973099]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
[69239.973234]  cma_netevent_work_handler+0x75/0xd0 [rdma_cm]
[69239.973362]  process_one_work+0x1d3/0x390
[69239.973516]  worker_thread+0x30/0x390
[69239.973631]  ? process_one_work+0x390/0x390
[69239.973743]  kthread+0x134/0x150
[69239.973863]  ? set_kthread_struct+0x50/0x50
[69239.973976]  ret_from_fork+0x1f/0x40
[69239.974099] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ldiskfs(OE) mbcache jbd2 ko2iblnd(OE) lnet(OE) libcfs(OE) bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm cirrus drm_shmem_helper kvm_intel kvm irqbypass drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect ghash_clmulni_intel sysimgblt rapl drm i2c_piix4 pcspkr virtio_balloon joydev knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ata_generic mlx5_core(OE) mlxfw(OE) ata_piix psample pci_hyperv_intf tls crc32c_intel virtio_console virtio_blk libata serio_raw mlxdevm(OE) xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) nvme_fabrics(OE) nvme_core(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) t10_pi
[69239.975388] CR2: 000000000000004c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2025-05-20-16-19-24-929.png
0.6 kB
20/May/25 8:19 AM

Issue Links

is related to

LU-18364 rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect

Open

is related to

LU-18260 o2iblnd: graceful handling of unexpected RDMA_CM_EVENT_REJECTED

Resolved

LU-16184 o2iblnd: set TX deadline when adding to peer queue

Resolved

LU-17480 lustre_rmmod hangs if a lnet route is down

Resolved

o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates