Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18426

reboot lustre server, crash in mlx5_del_flow_rules

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.15.5
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 bond
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Construct Luster server restart
      • Crash occurs on the server

      server call trace:

      [ 1447.134016] kvm: exiting hardware virtualization
      [ 1448.366745] restrack: ------------[ cut here ]------------
      [ 1448.366779] infiniband mlx5_0: BUG: RESTRACK detected leak of resources
      [ 1448.366808] restrack: Kernel CQ object allocated by ib_core is not freed
      [ 1448.366842] restrack: ------------[ cut here ]------------
      [ 1448.368635] mlx5_core 0000:00:06.0: Shutdown was called
      [ 1449.106504] mlx5_core 0000:00:05.0: Shutdown was called
      [ 1449.137742] mlx5_core 0000:00:05.0: mlx5_activate_lag:800:(pid 146): Failed to create LAG port selection(-67)
      [ 1449.138260] mlx5_ib.rdma: probe of mlx5_core.rdma.0 failed with error -12
      [ 1449.138321] mlx5_ib.rdma: probe of mlx5_core.rdma.1 failed with error -12
      [ 1449.138360] mlx5_core 0000:00:05.0: mlx5_create_match_definer:3865:(pid 146): Failed to create match definer (-67)
      [ 1449.138422] general protection fault, probably for non-canonical address 0x24eb755e39b8265c: 0000 [#1] SMP NOPTI
      [ 1449.138473] CPU: 14 PID: 146 Comm: kworker/u40:14 Kdump: loaded Tainted: G        W  OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
      [ 1449.138531] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
      [ 1449.138579] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
      [ 1449.138706] RIP: 0010:mlx5_del_flow_rules+0x16/0x140 [mlx5_core]
      [ 1449.138794] Code: 89 d8 e9 a5 fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 41 56 31 f6 41 55 49 89 fd 41 54 55 53 48 8b 47 08 <4c> 8b 60 28 4c 89 e7 e8 5e bf ff ff 41 8b 45 00 83 e8 01 78 2c 48
      [ 1449.138878] RSP: 0018:ff1c7b6c467b3cc0 EFLAGS: 00010246
      [ 1449.138898] RAX: 24eb755e39b8265c RBX: 0000000000000000 RCX: 0000000000000000
      [ 1449.138931] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff18e596d900a5d0
      [ 1449.138980] RBP: ff18e597287b7c00 R08: 0000000080000000 R09: ff18e5972816a5c0
      [ 1449.139014] R10: 0000000000000001 R11: ff1c7b6c467b39e0 R12: ff18e596d8847800
      [ 1449.139032] R13: ff18e596d900a5d0 R14: 0000000000000001 R15: ff1c7b6c467b3e48
      [ 1449.139064] FS:  0000000000000000(0000) GS:ff18e5b57f980000(0000) knlGS:0000000000000000
      [ 1449.139110] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1449.139137] CR2: 000055efdbdd3b10 CR3: 0000000cc9010001 CR4: 0000000000771ee0
      [ 1449.139158] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1449.139194] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1449.139239] PKRU: 55555554
      [ 1449.139278] Call Trace:
      [ 1449.139327]  ? __die_body+0x1a/0x60
      [ 1449.139360]  ? die_addr+0x38/0x51
      [ 1449.139371]  ? do_general_protection+0x135/0x280
      [ 1449.139386]  ? general_protection+0x1e/0x30
      [ 1449.139401]  ? mlx5_del_flow_rules+0x16/0x140 [mlx5_core]
      [ 1449.139505]  mlx5_lag_destroy_definer+0x40/0x90 [mlx5_core]
      [ 1449.139620]  mlx5_lag_destroy_definers+0x45/0x80 [mlx5_core]
      [ 1449.139721]  mlx5_lag_port_sel_create+0x121/0x1c0 [mlx5_core]
      [ 1449.139826]  mlx5_activate_lag+0xbc/0x1c0 [mlx5_core]
      [ 1449.139924]  ? kfree+0xd3/0x250
      [ 1449.139957]  ? mlx5_rescan_drivers_locked+0x129/0x1a0 [mlx5_core]
      [ 1449.140051]  mlx5_do_bond_work+0x451/0x630 [mlx5_core]
      [ 1449.140156]  process_one_work+0x1d3/0x390
      [ 1449.140200]  worker_thread+0x30/0x390
      [ 1449.140240]  ? process_one_work+0x390/0x390
      [ 1449.140267]  kthread+0x134/0x150
      [ 1449.140279]  ? set_kthread_struct+0x50/0x50
      [ 1449.140291]  ret_from_fork+0x1f/0x40
      [ 1449.140308] Modules linked in: bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc mlx5_ib(OE) ib_uverbs(OE) intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel cirrus drm_shmem_helper drm_kms_helper kvm syscopyarea sysfillrect sysimgblt irqbypass crct10dif_pclmul crc32_pclmul drm mlx5_core(OE) ghash_clmulni_intel mlxdevm(OE) rapl psample mlxfw(OE) tls pcspkr joydev i2c_piix4 virtio_balloon pci_hyperv_intf knem(OE) xfs libcrc32c ata_generic ata_piix libata crc32c_intel virtio_console virtio_blk serio_raw xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_core(OE) nvme_fabrics(OE) nvme_core(OE) mlx_compat(OE) t10_pi
       

       

      Attachments

        Activity

          People

            wc-triage WC Triage
            yuan.liu Yuan Liu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: