Details
Description
Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.
【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64
【Network Card】
Client:
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
Server:
MLX CX6 2*100G RoCE v2 bond
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
【BUG Info】
Here is the following reproducer:
- Mount lustre on a RoCE network
- Construct Luster server restart
- Crash occurs on the server
server call trace:
[ 1447.134016] kvm: exiting hardware virtualization
[ 1448.366745] restrack: ------------[ cut here ]------------
[ 1448.366779] infiniband mlx5_0: BUG: RESTRACK detected leak of resources
[ 1448.366808] restrack: Kernel CQ object allocated by ib_core is not freed
[ 1448.366842] restrack: ------------[ cut here ]------------
[ 1448.368635] mlx5_core 0000:00:06.0: Shutdown was called
[ 1449.106504] mlx5_core 0000:00:05.0: Shutdown was called
[ 1449.137742] mlx5_core 0000:00:05.0: mlx5_activate_lag:800:(pid 146): Failed to create LAG port selection(-67)
[ 1449.138260] mlx5_ib.rdma: probe of mlx5_core.rdma.0 failed with error -12
[ 1449.138321] mlx5_ib.rdma: probe of mlx5_core.rdma.1 failed with error -12
[ 1449.138360] mlx5_core 0000:00:05.0: mlx5_create_match_definer:3865:(pid 146): Failed to create match definer (-67)
[ 1449.138422] general protection fault, probably for non-canonical address 0x24eb755e39b8265c: 0000 [#1] SMP NOPTI
[ 1449.138473] CPU: 14 PID: 146 Comm: kworker/u40:14 Kdump: loaded Tainted: G W OE -------- - - 4.18.0-553.5.1.el8_lustre.x86_64 #1
[ 1449.138531] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
[ 1449.138579] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
[ 1449.138706] RIP: 0010:mlx5_del_flow_rules+0x16/0x140 [mlx5_core]
[ 1449.138794] Code: 89 d8 e9 a5 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 56 31 f6 41 55 49 89 fd 41 54 55 53 48 8b 47 08 <4c> 8b 60 28 4c 89 e7 e8 5e bf ff ff 41 8b 45 00 83 e8 01 78 2c 48
[ 1449.138878] RSP: 0018:ff1c7b6c467b3cc0 EFLAGS: 00010246
[ 1449.138898] RAX: 24eb755e39b8265c RBX: 0000000000000000 RCX: 0000000000000000
[ 1449.138931] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff18e596d900a5d0
[ 1449.138980] RBP: ff18e597287b7c00 R08: 0000000080000000 R09: ff18e5972816a5c0
[ 1449.139014] R10: 0000000000000001 R11: ff1c7b6c467b39e0 R12: ff18e596d8847800
[ 1449.139032] R13: ff18e596d900a5d0 R14: 0000000000000001 R15: ff1c7b6c467b3e48
[ 1449.139064] FS: 0000000000000000(0000) GS:ff18e5b57f980000(0000) knlGS:0000000000000000
[ 1449.139110] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1449.139137] CR2: 000055efdbdd3b10 CR3: 0000000cc9010001 CR4: 0000000000771ee0
[ 1449.139158] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1449.139194] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1449.139239] PKRU: 55555554
[ 1449.139278] Call Trace:
[ 1449.139327] ? __die_body+0x1a/0x60
[ 1449.139360] ? die_addr+0x38/0x51
[ 1449.139371] ? do_general_protection+0x135/0x280
[ 1449.139386] ? general_protection+0x1e/0x30
[ 1449.139401] ? mlx5_del_flow_rules+0x16/0x140 [mlx5_core]
[ 1449.139505] mlx5_lag_destroy_definer+0x40/0x90 [mlx5_core]
[ 1449.139620] mlx5_lag_destroy_definers+0x45/0x80 [mlx5_core]
[ 1449.139721] mlx5_lag_port_sel_create+0x121/0x1c0 [mlx5_core]
[ 1449.139826] mlx5_activate_lag+0xbc/0x1c0 [mlx5_core]
[ 1449.139924] ? kfree+0xd3/0x250
[ 1449.139957] ? mlx5_rescan_drivers_locked+0x129/0x1a0 [mlx5_core]
[ 1449.140051] mlx5_do_bond_work+0x451/0x630 [mlx5_core]
[ 1449.140156] process_one_work+0x1d3/0x390
[ 1449.140200] worker_thread+0x30/0x390
[ 1449.140240] ? process_one_work+0x390/0x390
[ 1449.140267] kthread+0x134/0x150
[ 1449.140279] ? set_kthread_struct+0x50/0x50
[ 1449.140291] ret_from_fork+0x1f/0x40
[ 1449.140308] Modules linked in: bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc mlx5_ib(OE) ib_uverbs(OE) intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel cirrus drm_shmem_helper drm_kms_helper kvm syscopyarea sysfillrect sysimgblt irqbypass crct10dif_pclmul crc32_pclmul drm mlx5_core(OE) ghash_clmulni_intel mlxdevm(OE) rapl psample mlxfw(OE) tls pcspkr joydev i2c_piix4 virtio_balloon pci_hyperv_intf knem(OE) xfs libcrc32c ata_generic ata_piix libata crc32c_intel virtio_console virtio_blk serio_raw xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) nvme_fabrics(OE) nvme_core(OE) mlx_compat(OE) t10_pi