Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18275

o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.5
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 bond
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Construct lustre MDT(mdt0-10.255.153.128@o2ib) restart
      • Crash occurs on other lustre servers

      Server call trace:

      crash> bt
      PID: 568423   TASK: ff4787632aa5c000  CPU: 5    COMMAND: "kworker/u40:0"
       #0 [ff7d728b15e6baa0] machine_kexec at ffffffff8fa6f353
       #1 [ff7d728b15e6baf8] __crash_kexec at ffffffff8fbbaa7a
       #2 [ff7d728b15e6bbb8] crash_kexec at ffffffff8fbbb9b1
       #3 [ff7d728b15e6bbd0] oops_end at ffffffff8fa2d831
       #4 [ff7d728b15e6bbf0] no_context at ffffffff8fa81cf3
       #5 [ff7d728b15e6bc48] __bad_area_nosemaphore at ffffffff8fa8206c
       #6 [ff7d728b15e6bc90] do_page_fault at ffffffff8fa82cf7
       #7 [ff7d728b15e6bcc0] page_fault at ffffffff906011ae
          [exception RIP: kiblnd_cm_callback+2653]
          RIP: ffffffffc0efe00d  RSP: ff7d728b15e6bd70  RFLAGS: 00010246
          RAX: 0000000000000007  RBX: ff7d728b15e6be08  RCX: 0000000000000000
          RDX: ff4787632aa5c000  RSI: ff7d728b15e6be08  RDI: ff47876095213c00
          RBP: ff47876095213c00   R8: 0000000000000000   R9: 006d635f616d6472
          R10: 8080808080808080  R11: 0000000000000000  R12: ff47876095213c00
          R13: 0000000000000000  R14: 0000000000000000  R15: ff47876095213de0
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #8 [ff7d728b15e6bdd8] cma_cm_event_handler at ffffffffc04729a5 [rdma_cm]
       #9 [ff7d728b15e6be00] cma_netevent_work_handler at ffffffffc04786b5 [rdma_cm]
      #10 [ff7d728b15e6be90] process_one_work at ffffffff8fb195e3
      #11 [ff7d728b15e6bed8] worker_thread at ffffffff8fb197d0
      #12 [ff7d728b15e6bf10] kthread at ffffffff8fb20e24
      #13 [ff7d728b15e6bf50] ret_from_fork at ffffffff9060028f 

      Server kernel log:

      [69106.143672] LustreError: 11-0: lustre-MDT000a-osp-MDT0007: operation mds_statfs to node 10.255.153.128@o2ib failed: rc = -107
      [69106.143700] Lustre: lustre-OST0038-osc-MDT0007: Connection to lustre-OST0038 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69106.145053] LustreError: Skipped 29 previous similar messages
      [69106.145060] Lustre: Skipped 203 previous similar messages
      [69111.263490] Lustre: lustre-OST004e-osc-MDT0007: Connection to lustre-OST004e (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69111.263496] Lustre: Skipped 6 previous similar messages
      [69112.506974] Lustre: lustre-OST0038-osc-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69112.506980] Lustre: Skipped 195 previous similar messages
      [69116.383952] Lustre: lustre-MDT0000-lwp-MDT0007: Connection to lustre-MDT0000 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69116.383965] Lustre: Skipped 3 previous similar messages
      [69122.528880] Lustre: lustre-MDT000a-lwp-OST0013: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69122.528885] Lustre: Skipped 3 previous similar messages
      [69127.659951] Lustre: lustre-OST0043-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69127.659960] Lustre: Skipped 7 previous similar messages
      [69138.672269] Lustre: lustre-OST002d-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69138.672275] Lustre: Skipped 3 previous similar messages
      [69158.168201] Lustre: lustre-MDT000a-osp-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69158.168206] Lustre: Skipped 1 previous similar message
      [69178.775546] Lustre: lustre-MDT0000-osp-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69178.775554] Lustre: Skipped 11 previous similar messages
      [69178.805333] Lustre: lustre-OST0008: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.805386] Lustre: lustre-OST0013: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.805505] Lustre: lustre-OST001e: deleting orphan objects from 0x0:13846 to 0x0:13921
      [69178.805729] Lustre: lustre-OST003f: deleting orphan objects from 0x0:13822 to 0x0:13857
      [69178.806122] Lustre: lustre-OST004a: deleting orphan objects from 0x0:13855 to 0x0:13889
      [69178.806130] Lustre: lustre-OST0055: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.807518] Lustre: lustre-OST0029: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.807537] Lustre: lustre-OST0034: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.838633] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0034: Delaying reintegration for qtype:2 until pending updates are flushed.
      [69178.840301] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 1 previous similar message
      [69180.358640] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0013: Delaying reintegration for qtype:2 until pending updates are flushed.
      [69180.359161] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 2 previous similar messages
      [69239.967382] BUG: unable to handle kernel NULL pointer dereference at 000000000000004c
      [69239.968927] PGD 0 
      [69239.969144] Oops: 0000 [#1] SMP NOPTI
      [69239.969327] CPU: 5 PID: 568423 Comm: kworker/u40:0 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
      [69239.969650] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
      [69239.969792] Workqueue: rdma_cm cma_netevent_work_handler [rdma_cm]
      [69239.969995] RIP: 0010:kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
      [69239.970191] Code: 48 89 05 06 7f 01 00 c7 05 04 7f 01 00 00 00 02 02 48 c7 05 01 7f 01 00 d0 5e f1 c0 e8 ac 66 ee ff e9 b5 f7 ff ff 4c 8b 6f 08 <41> 8b 6d 4c f6 05 e5 d4 f0 ff 01 0f 84 8d 00 00 00 f6 05 dc d4 f0
      [69239.970483] RSP: 0018:ff7d728b15e6bd70 EFLAGS: 00010246
      [69239.970621] RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000
      [69239.970763] RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00
      [69239.970895] RBP: ff47876095213c00 R08: 0000000000000000 R09: 006d635f616d6472
      [69239.971025] R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00
      [69239.971155] R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0
      [69239.971289] FS:  0000000000000000(0000) GS:ff47877d7f740000(0000) knlGS:0000000000000000
      [69239.971425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [69239.971570] CR2: 000000000000004c CR3: 0000001900e10004 CR4: 0000000000771ee0
      [69239.971708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [69239.971841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [69239.971971] PKRU: 55555554
      [69239.972099] Call Trace:
      [69239.972267]  ? __die_body+0x1a/0x60
      [69239.972451]  ? no_context+0x1ba/0x3f0
      [69239.972583]  ? __bad_area_nosemaphore+0x16c/0x1c0
      [69239.972705]  ? do_page_fault+0x37/0x12d
      [69239.972826]  ? page_fault+0x1e/0x30
      [69239.972971]  ? kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
      [69239.973099]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
      [69239.973234]  cma_netevent_work_handler+0x75/0xd0 [rdma_cm]
      [69239.973362]  process_one_work+0x1d3/0x390
      [69239.973516]  worker_thread+0x30/0x390
      [69239.973631]  ? process_one_work+0x390/0x390
      [69239.973743]  kthread+0x134/0x150
      [69239.973863]  ? set_kthread_struct+0x50/0x50
      [69239.973976]  ret_from_fork+0x1f/0x40
      [69239.974099] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ldiskfs(OE) mbcache jbd2 ko2iblnd(OE) lnet(OE) libcfs(OE) bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm cirrus drm_shmem_helper kvm_intel kvm irqbypass drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect ghash_clmulni_intel sysimgblt rapl drm i2c_piix4 pcspkr virtio_balloon joydev knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ata_generic mlx5_core(OE) mlxfw(OE) ata_piix psample pci_hyperv_intf tls crc32c_intel virtio_console virtio_blk libata serio_raw mlxdevm(OE) xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) nvme_fabrics(OE) nvme_core(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) t10_pi
      [69239.975388] CR2: 000000000000004c 

      Attachments

        Issue Links

          Activity

            People

              xiyan Rongyao Peng
              xiyan Rongyao Peng
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: