Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.15.5
-
Lustre server 2.15.5 RoCE
Lustre MGS 2.15.5 RoCE
Lustre client 2.15.5 RoCE
-
3
-
9223372036854775807
Description
Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.
【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64
【Network Card】
Client:
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
Server:
MLX CX6 2*100G RoCE v2 bond
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
【BUG Info】
Here is the following reproducer:
- Mount lustre on a RoCE network
- Construct lustre MDT(mdt0-10.255.153.128@o2ib) restart
- Crash occurs on other lustre servers
Server call trace:
crash> bt
PID: 568423 TASK: ff4787632aa5c000 CPU: 5 COMMAND: "kworker/u40:0"
#0 [ff7d728b15e6baa0] machine_kexec at ffffffff8fa6f353
#1 [ff7d728b15e6baf8] __crash_kexec at ffffffff8fbbaa7a
#2 [ff7d728b15e6bbb8] crash_kexec at ffffffff8fbbb9b1
#3 [ff7d728b15e6bbd0] oops_end at ffffffff8fa2d831
#4 [ff7d728b15e6bbf0] no_context at ffffffff8fa81cf3
#5 [ff7d728b15e6bc48] __bad_area_nosemaphore at ffffffff8fa8206c
#6 [ff7d728b15e6bc90] do_page_fault at ffffffff8fa82cf7
#7 [ff7d728b15e6bcc0] page_fault at ffffffff906011ae
[exception RIP: kiblnd_cm_callback+2653]
RIP: ffffffffc0efe00d RSP: ff7d728b15e6bd70 RFLAGS: 00010246
RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000
RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00
RBP: ff47876095213c00 R8: 0000000000000000 R9: 006d635f616d6472
R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00
R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ff7d728b15e6bdd8] cma_cm_event_handler at ffffffffc04729a5 [rdma_cm]
#9 [ff7d728b15e6be00] cma_netevent_work_handler at ffffffffc04786b5 [rdma_cm]
#10 [ff7d728b15e6be90] process_one_work at ffffffff8fb195e3
#11 [ff7d728b15e6bed8] worker_thread at ffffffff8fb197d0
#12 [ff7d728b15e6bf10] kthread at ffffffff8fb20e24
#13 [ff7d728b15e6bf50] ret_from_fork at ffffffff9060028f
Server kernel log:
[69106.143672] LustreError: 11-0: lustre-MDT000a-osp-MDT0007: operation mds_statfs to node 10.255.153.128@o2ib failed: rc = -107 [69106.143700] Lustre: lustre-OST0038-osc-MDT0007: Connection to lustre-OST0038 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete [69106.145053] LustreError: Skipped 29 previous similar messages [69106.145060] Lustre: Skipped 203 previous similar messages [69111.263490] Lustre: lustre-OST004e-osc-MDT0007: Connection to lustre-OST004e (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete [69111.263496] Lustre: Skipped 6 previous similar messages [69112.506974] Lustre: lustre-OST0038-osc-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib) [69112.506980] Lustre: Skipped 195 previous similar messages [69116.383952] Lustre: lustre-MDT0000-lwp-MDT0007: Connection to lustre-MDT0000 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete [69116.383965] Lustre: Skipped 3 previous similar messages [69122.528880] Lustre: lustre-MDT000a-lwp-OST0013: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib) [69122.528885] Lustre: Skipped 3 previous similar messages [69127.659951] Lustre: lustre-OST0043-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib) [69127.659960] Lustre: Skipped 7 previous similar messages [69138.672269] Lustre: lustre-OST002d-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib) [69138.672275] Lustre: Skipped 3 previous similar messages [69158.168201] Lustre: lustre-MDT000a-osp-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib) [69158.168206] Lustre: Skipped 1 previous similar message [69178.775546] Lustre: lustre-MDT0000-osp-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib) [69178.775554] Lustre: Skipped 11 previous similar messages [69178.805333] Lustre: lustre-OST0008: deleting orphan objects from 0x0:13854 to 0x0:13889 [69178.805386] Lustre: lustre-OST0013: deleting orphan objects from 0x0:13854 to 0x0:13889 [69178.805505] Lustre: lustre-OST001e: deleting orphan objects from 0x0:13846 to 0x0:13921 [69178.805729] Lustre: lustre-OST003f: deleting orphan objects from 0x0:13822 to 0x0:13857 [69178.806122] Lustre: lustre-OST004a: deleting orphan objects from 0x0:13855 to 0x0:13889 [69178.806130] Lustre: lustre-OST0055: deleting orphan objects from 0x0:13854 to 0x0:13889 [69178.807518] Lustre: lustre-OST0029: deleting orphan objects from 0x0:13854 to 0x0:13889 [69178.807537] Lustre: lustre-OST0034: deleting orphan objects from 0x0:13854 to 0x0:13889 [69178.838633] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0034: Delaying reintegration for qtype:2 until pending updates are flushed. [69178.840301] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 1 previous similar message [69180.358640] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0013: Delaying reintegration for qtype:2 until pending updates are flushed. [69180.359161] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 2 previous similar messages [69239.967382] BUG: unable to handle kernel NULL pointer dereference at 000000000000004c [69239.968927] PGD 0 [69239.969144] Oops: 0000 [#1] SMP NOPTI [69239.969327] CPU: 5 PID: 568423 Comm: kworker/u40:0 Kdump: loaded Tainted: G OE -------- - - 4.18.0-553.5.1.el8_lustre.x86_64 #1 [69239.969650] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014 [69239.969792] Workqueue: rdma_cm cma_netevent_work_handler [rdma_cm] [69239.969995] RIP: 0010:kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd] [69239.970191] Code: 48 89 05 06 7f 01 00 c7 05 04 7f 01 00 00 00 02 02 48 c7 05 01 7f 01 00 d0 5e f1 c0 e8 ac 66 ee ff e9 b5 f7 ff ff 4c 8b 6f 08 <41> 8b 6d 4c f6 05 e5 d4 f0 ff 01 0f 84 8d 00 00 00 f6 05 dc d4 f0 [69239.970483] RSP: 0018:ff7d728b15e6bd70 EFLAGS: 00010246 [69239.970621] RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000 [69239.970763] RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00 [69239.970895] RBP: ff47876095213c00 R08: 0000000000000000 R09: 006d635f616d6472 [69239.971025] R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00 [69239.971155] R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0 [69239.971289] FS: 0000000000000000(0000) GS:ff47877d7f740000(0000) knlGS:0000000000000000 [69239.971425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [69239.971570] CR2: 000000000000004c CR3: 0000001900e10004 CR4: 0000000000771ee0 [69239.971708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [69239.971841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [69239.971971] PKRU: 55555554 [69239.972099] Call Trace: [69239.972267] ? __die_body+0x1a/0x60 [69239.972451] ? no_context+0x1ba/0x3f0 [69239.972583] ? __bad_area_nosemaphore+0x16c/0x1c0 [69239.972705] ? do_page_fault+0x37/0x12d [69239.972826] ? page_fault+0x1e/0x30 [69239.972971] ? kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd] [69239.973099] cma_cm_event_handler+0x25/0xd0 [rdma_cm] [69239.973234] cma_netevent_work_handler+0x75/0xd0 [rdma_cm] [69239.973362] process_one_work+0x1d3/0x390 [69239.973516] worker_thread+0x30/0x390 [69239.973631] ? process_one_work+0x390/0x390 [69239.973743] kthread+0x134/0x150 [69239.973863] ? set_kthread_struct+0x50/0x50 [69239.973976] ret_from_fork+0x1f/0x40 [69239.974099] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ldiskfs(OE) mbcache jbd2 ko2iblnd(OE) lnet(OE) libcfs(OE) bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm cirrus drm_shmem_helper kvm_intel kvm irqbypass drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect ghash_clmulni_intel sysimgblt rapl drm i2c_piix4 pcspkr virtio_balloon joydev knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ata_generic mlx5_core(OE) mlxfw(OE) ata_piix psample pci_hyperv_intf tls crc32c_intel virtio_console virtio_blk libata serio_raw mlxdevm(OE) xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) nvme_fabrics(OE) nvme_core(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) t10_pi [69239.975388] CR2: 000000000000004c
Attachments
Issue Links
- is related to
-
LU-18364 rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect
- Open