Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18364

rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.5
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      1. Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 bond
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Construct Luster server reboot
      • Crash occurs on the server

      Server call trace:

      crash> bt

      PID: 144 TASK: ff1f28f603dcc000 CPU: 4 COMMAND: "kworker/u40:12"

      #0 [ff310f004368bbc0] machine_kexec at ffffffffadc6f353

      #1 [ff310f004368bc18] __crash_kexec at ffffffffaddbaa7a

      #2 [ff310f004368bcd8] crash_kexec at ffffffffaddbb9b1

      #3 [ff310f004368bcf0] oops_end at ffffffffadc2d831

      #4 [ff310f004368bd10] no_context at ffffffffadc81cf3

      #5 [ff310f004368bd68] __bad_area_nosemaphore at ffffffffadc8206c

      #6 [ff310f004368bdb0] do_page_fault at ffffffffadc82cf7

      #7 [ff310f004368bde0] page_fault at ffffffffae8011ae

      [exception RIP: process_one_work+46]

      RIP: ffffffffadd1943e RSP: ff310f004368be98 RFLAGS: 00010046

      RAX: 0000000000000000 RBX: ff1f28f60a7575d8 RCX: ff1f28f6aab70760

      RDX: 00000000fffeae01 RSI: ff1f28f60a7575d8 RDI: ff1f28f603dca840

      RBP: ff1f28f600019400 R8: 00000000000000ad R9: ff310f004368bb88

      R10: ff310f004368bd68 R11: ff1f28f6cb1550ac R12: 0000000000000000

      R13: ff1f28f600019420 R14: ff1f28f6000194d0 R15: ff1f28f603dca840

      ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

      #8 [ff310f004368bed8] worker_thread at ffffffffadd197d0

      #9 [ff310f004368bf10] kthread at ffffffffadd20e24

      #10 [ff310f004368bf50] ret_from_fork at ffffffffae80028f

       

      Server kernel log:

      [ 50.700202] Lustre: Lustre: Build Version: 2.15.5

      [ 50.717961] LNet: Using FastReg for registration

      [ 50.876539] LNet: Added LNI 10.255.40.5@o2ib [8/256/0/180]

      [ 50.974248] LDISKFS-fs (nvme0n1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc

      [ 52.201495] LDISKFS-fs (nvme0n2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc

      .............................................

      [ 105.395060] Lustre: lustre-OST000c: deleting orphan objects from 0x400000402:1506 to 0x400000402:1569

      [ 105.396348] Lustre: lustre-OST0003: deleting orphan objects from 0x340000401:6 to 0x340000401:1793

      [ 105.396611] Lustre: lustre-OST000c: deleting orphan objects from 0x0:3000 to 0x0:3041

      ................................................

      [ 162.093229] LustreError: 137-5: lustre-OST0007_UUID: not available for connect from 10.255.102.59@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.

      [ 162.093412] LustreError: Skipped 3 previous similar messages

      [ 162.276036] hrtimer: interrupt took 5325 ns

      [ 162.320673] LDISKFS-fs warning (device nvme0n14): ldiskfs_multi_mount_protect:331: MMP interval 42 higher than expected, please wait.

      [ 183.775739] LDISKFS-fs warning (device nvme0n14): ldiskfs_multi_mount_protect:344: Device is already active on another node.

      [ 183.775759] LDISKFS-fs warning (device nvme0n14): ldiskfs_multi_mount_protect:344: MMP failure info: last update time: 1728560802, last update node: node2-lustre, last update device: nvme0n14

      [ 183.775924] LustreError: 7105:0:(osd_handler.c:8111:osd_mount()) lustre-OST000d-osd: can't mount /dev/nvme0n14: -22

      [ 183.776234] LustreError: 7105:0:(obd_config.c:774:class_setup()) setup lustre-OST000d-osd failed (-22)

      [ 183.776330] LustreError: 7105:0:(obd_mount.c:200:lustre_start_simple()) lustre-OST000d-osd setup error -22

      [ 183.776495] LustreError: 7105:0:(obd_mount_server.c:1993:server_fill_super()) Unable to start osd on /dev/nvme0n14: -22

      [ 183.776600] LustreError: 7105:0:(super25.c:183:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -22

      [ 184.223017] LDISKFS-fs (nvme0n14): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc

      [ 184.354454] Lustre: lustre-OST000d: Imperative Recovery not enabled, recovery window 300-900

      [ 184.354461] Lustre: Skipped 5 previous similar messages

      [ 186.335038] Lustre: 4064:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1728560819/real 0] req@00000000c5c19397 x1812527255153280/t0(0) o400->lustre-MDT0002-lwp-OST000c@10.255.40.6@o2ib:12/10 lens 224/224 e 0 to 1 dl 1728560826 ref 2 fl Rpc:XNr/0/ffffffff rc 0/-1 job:''

      [ 186.335045] Lustre: 4064:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 1 previous similar message

      [ 186.335049] Lustre: lustre-MDT0000-lwp-OST000c: Connection to lustre-MDT0000 (at 10.255.40.6@o2ib) was lost; in progress operations using this service will wait for recovery to complete

      [ 191.279301] Lustre: lustre-OST000d: Will be in recovery for at least 5:00, or until 4 clients reconnect

      [ 191.279307] Lustre: Skipped 4 previous similar messages

      [ 203.233227] Lustre: lustre-MDT0000-lwp-OST000c: Connection restored to 10.255.40.7@o2ib (at 10.255.40.7@o2ib)

      [ 208.086625] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.255.40.7@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.

      [ 208.086693] Lustre: lustre-OST000d: Denying connection for new client lustre-MDT0002-mdtlov_UUID (at 10.255.40.7@o2ib), waiting for 4 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 4:42

      [ 208.107410] Lustre: lustre-OST000d: Recovery over after 0:17, of 4 clients 4 recovered and 0 were evicted.

      [ 208.107414] Lustre: Skipped 4 previous similar messages

      [ 208.109912] Lustre: lustre-OST000d: deleting orphan objects from 0x580000402:2050 to 0x580000402:2081

      [ 208.110745] Lustre: lustre-OST000d: deleting orphan objects from 0x580000401:8 to 0x580000401:2017

      [ 208.353096] Lustre: lustre-MDT0000-lwp-OST0009: Connection restored to 10.255.40.7@o2ib (at 10.255.40.7@o2ib)

      [ 208.353099] Lustre: Skipped 1 previous similar message

      [ 208.945247] Lustre: lustre-OST0000: deleting orphan objects from 0x0:3128 to 0x0:3201

      .........................................................................................

      [ 213.409120] Lustre: lustre-MDT0000-lwp-OST0006: Connection restored to 10.255.40.7@o2ib (at 10.255.40.7@o2ib)

      [ 213.409125] Lustre: Skipped 7 previous similar messages

      [ 213.472526] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

      Attachments

        Issue Links

          Activity

            [LU-18364] rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect
            yuan.liu Yuan Liu added a comment - - edited

            Hi  eaujames,

            We've found a stable reproduction step for the crash issue:
            1. We only use one network card, and do not use bonding.
            2. Use vdbench run read/write test case on the lustre client.
            3. Construct an ARP update for a lustre server IP address on the lustre client.

            for example, the lustre client ip is 192.168.122.220,  the lustre server ip is 192.168.122.115, so do "arp -s 192.168.122.115 10:71:fc:69:92:b8 && arp -d 192.168.122.115" on 192.168.122.220, 10:71:fc:69:92:b8 is a wrong mac address.

            The crash stack is blow:

                  KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]
                DUMPFILE: vmcore  [PARTIAL DUMP]
                    CPUS: 20
                    DATE: Tue Dec  3 14:58:41 CST 2024
                  UPTIME: 00:06:20
            LOAD AVERAGE: 10.14, 2.56, 0.86
                   TASKS: 1076
                NODENAME: rocky8vm3
                 RELEASE: 4.18.0-553.el8_10.x86_64
                 VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024
                 MACHINE: x86_64  (2599 Mhz)
                  MEMORY: 31.4 GB
                   PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
                     PID: 607
                 COMMAND: "kworker/u40:28"
                    TASK: ff1e34360b6e0000  [THREAD_INFO: ff1e34360b6e0000]
                     CPU: 1
                   STATE: TASK_RUNNING (PANIC)crash> bt
            PID: 607      TASK: ff1e34360b6e0000  CPU: 1    COMMAND: "kworker/u40:28"
             #0 [ff4de14b444cbbc0] machine_kexec at ffffffff9c46f2d3
             #1 [ff4de14b444cbc18] __crash_kexec at ffffffff9c5baa5a
             #2 [ff4de14b444cbcd8] crash_kexec at ffffffff9c5bb991
             #3 [ff4de14b444cbcf0] oops_end at ffffffff9c42d811
             #4 [ff4de14b444cbd10] no_context at ffffffff9c481cf3
             #5 [ff4de14b444cbd68] __bad_area_nosemaphore at ffffffff9c48206c
             #6 [ff4de14b444cbdb0] do_page_fault at ffffffff9c482cf7
             #7 [ff4de14b444cbde0] page_fault at ffffffff9d0011ae
                [exception RIP: process_one_work+46]
                RIP: ffffffff9c51944e  RSP: ff4de14b444cbe98  RFLAGS: 00010046
                RAX: 0000000000000000  RBX: ff1e34360734b5d8  RCX: dead000000000200
                RDX: 000000010001393f  RSI: ff1e34360734b5d8  RDI: ff1e343ca7eed5c0
                RBP: ff1e343600019400   R8: ff1e343d37c73bb8   R9: 0000005885358800
                R10: 0000000000000000  R11: ff1e343d37c71dc4  R12: 0000000000000000
                R13: ff1e343600019420  R14: ff1e3436000194d0  R15: ff1e343ca7eed5c0
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
             #8 [ff4de14b444cbed8] worker_thread at ffffffff9c5197e0
             #9 [ff4de14b444cbf10] kthread at ffffffff9c520e04
            #10 [ff4de14b444cbf50] ret_from_fork at ffffffff9d00028f 

            Another stack is below:

            [ 1656.060089] list_del corruption. next->prev should be ff4880c9d81b8d48, but was ff4880ccfb2d45e0
            [ 1656.060536] ------------[ cut here ]------------
            [ 1656.060538] kernel BUG at lib/list_debug.c:56!
            [ 1656.060738] invalid opcode: 0000 [#1] SMP NOPTI
            [ 1656.060872] CPU: 4 PID: 606 Comm: kworker/u40:27 Kdump: loaded Tainted: GF          OE     -------- -  - 4.18.0-553.el8_10.x86_64 #1
            [ 1656.061130] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+1408+7b966129 04/01/2014
            [ 1656.061261] Workqueue: mlx5_cmd_0000:11:00.0 cmd_work_handler [mlx5_core]
            [ 1656.061457] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48
            [ 1656.061586] Code: 45 d4 99 e8 5e 52 c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 00 46 d4 99 e8 4a 52 c7 ff 0f 0b 48 c7 c7 b0 46 d4 99 e8 3c 52 c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 46 d4 99 e8 28 52 c7 ff 0f 0b
            [ 1656.061846] RSP: 0018:ff650559444dfe90 EFLAGS: 00010046
            [ 1656.061974] RAX: 0000000000000054 RBX: ff4880c9d81b8d40 RCX: 0000000000000000
            [ 1656.062103] RDX: 0000000000000000 RSI: ff4880cf9731e698 RDI: ff4880cf9731e698
            [ 1656.062238] RBP: ff4880c840019400 R08: 0000000000000000 R09: c0000000ffff7fff
            [ 1656.062366] R10: 0000000000000001 R11: ff650559444dfcb0 R12: ff4880c862647b00
            [ 1656.062492] R13: ff4880c879326540 R14: 0000000000000000 R15: ff4880c9d81b8d48
            [ 1656.062619] FS:  0000000000000000(0000) GS:ff4880cf97300000(0000) knlGS:0000000000000000
            [ 1656.062745] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 1656.062868] CR2: 000055cc1af6b000 CR3: 000000084b610006 CR4: 0000000000771ee0
            [ 1656.062996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            [ 1656.063127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
            [ 1656.063250] PKRU: 55555554
            
                  KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]
                DUMPFILE: vmcore  [PARTIAL DUMP]
                    CPUS: 20
                    DATE: Fri Nov 29 17:37:31 CST 2024
                  UPTIME: 00:27:35
            LOAD AVERAGE: 350.47, 237.79, 163.91
                   TASKS: 1106
                NODENAME: rocky8vm3
                 RELEASE: 4.18.0-553.el8_10.x86_64
                 VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024
                 MACHINE: x86_64  (2599 Mhz)
                  MEMORY: 31.4 GB
                   PANIC: "kernel BUG at lib/list_debug.c:56!"
                     PID: 606
                 COMMAND: "kworker/u40:27"
                    TASK: ff4880c8793f8000  [THREAD_INFO: ff4880c8793f8000]
                     CPU: 4
                   STATE: TASK_RUNNING (PANIC)crash> bt
            PID: 606      TASK: ff4880c8793f8000  CPU: 4    COMMAND: "kworker/u40:27"
             #0 [ff650559444dfc28] machine_kexec at ffffffff98a6f2d3
             #1 [ff650559444dfc80] __crash_kexec at ffffffff98bbaa5a
             #2 [ff650559444dfd40] crash_kexec at ffffffff98bbb991
             #3 [ff650559444dfd58] oops_end at ffffffff98a2d811
             #4 [ff650559444dfd78] do_trap at ffffffff98a29a27
             #5 [ff650559444dfdc0] do_invalid_op at ffffffff98a2a766
             #6 [ff650559444dfde0] invalid_op at ffffffff99600da4
                [exception RIP: __list_del_entry_valid.cold.1+32]
                RIP: ffffffff98ef8f98  RSP: ff650559444dfe90  RFLAGS: 00010046
                RAX: 0000000000000054  RBX: ff4880c9d81b8d40  RCX: 0000000000000000
                RDX: 0000000000000000  RSI: ff4880cf9731e698  RDI: ff4880cf9731e698
                RBP: ff4880c840019400   R8: 0000000000000000   R9: c0000000ffff7fff
                R10: 0000000000000001  R11: ff650559444dfcb0  R12: ff4880c862647b00
                R13: ff4880c879326540  R14: 0000000000000000  R15: ff4880c9d81b8d48
                ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
             #7 [ff650559444dfe90] process_one_work at ffffffff98b19557
             #8 [ff650559444dfed8] worker_thread at ffffffff98b197e0
             #9 [ff650559444dff10] kthread at ffffffff98b20e04
            #10 [ff650559444dff50] ret_from_fork at ffffffff9960028f

            This bug seems to be in rdma_cm module on the MOFED/kernel side. So we try to reproduce the crash on the Nvme-oF node:
            1. Mount the nvme-of disk, do "nvme connect -t rdma -n "nqn.2014-08.org.nvmexpress:67240ebd3fa63ca3" -a 192.168.122.30 -s 4421"
            2. Use dd run write/read test case, for example, "dd if=/dev/nvme0n17 of=./test bs=32K count=102400 oflag=direct"
            3. Construct an ARP update, do "arp -s 192.168.122.112 10:71:fe:69:93:b8 && arp -d 192.168.122.112" on the nvme_of client.
            4. The crash is already reproduce.

            The issue may involve the following key points:
            1. The RDMA module receives multiple network events simultaneously.
            2. We have observed that during normal ARP updates, one or more events may be generated, making this issue probabilistic.
            3. When both ARP update events and connection termination (conn disconnect) events are received at the same time, it triggers issue LU-18275.

            We are currently in contact with NVIDIA's network technology experts in China. If you have other channels, we could invite them to help solve the issue as well. Do you have any suggestions? Thank you.

            yuan.liu Yuan Liu added a comment - - edited Hi   eaujames , We've found a stable reproduction step for the crash issue: 1. We only use one network card, and do not use bonding. 2. Use vdbench run read/write test case on the lustre client. 3. Construct an ARP update for a lustre server IP address on the lustre client. for example, the lustre client ip is 192.168.122.220,  the lustre server ip is 192.168.122.115, so do "arp -s 192.168.122.115 10:71:fc:69:92:b8 && arp -d 192.168.122.115" on 192.168.122.220, 10:71:fc:69:92:b8 is a wrong mac address. The crash stack is blow:     KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]     DUMPFILE: vmcore  [PARTIAL DUMP]         CPUS: 20         DATE: Tue Dec  3 14:58:41 CST 2024       UPTIME: 00:06:20 LOAD AVERAGE: 10.14, 2.56, 0.86        TASKS: 1076     NODENAME: rocky8vm3      RELEASE: 4.18.0-553.el8_10.x86_64      VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024      MACHINE: x86_64  (2599 Mhz)       MEMORY: 31.4 GB        PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"          PID: 607      COMMAND: "kworker/u40:28"         TASK: ff1e34360b6e0000  [THREAD_INFO: ff1e34360b6e0000]          CPU: 1        STATE: TASK_RUNNING (PANIC)crash> bt PID: 607      TASK: ff1e34360b6e0000  CPU: 1    COMMAND: "kworker/u40:28"  #0 [ff4de14b444cbbc0] machine_kexec at ffffffff9c46f2d3  #1 [ff4de14b444cbc18] __crash_kexec at ffffffff9c5baa5a  #2 [ff4de14b444cbcd8] crash_kexec at ffffffff9c5bb991  #3 [ff4de14b444cbcf0] oops_end at ffffffff9c42d811  #4 [ff4de14b444cbd10] no_context at ffffffff9c481cf3  #5 [ff4de14b444cbd68] __bad_area_nosemaphore at ffffffff9c48206c  #6 [ff4de14b444cbdb0] do_page_fault at ffffffff9c482cf7  #7 [ff4de14b444cbde0] page_fault at ffffffff9d0011ae     [exception RIP: process_one_work+46]     RIP: ffffffff9c51944e  RSP: ff4de14b444cbe98  RFLAGS: 00010046     RAX: 0000000000000000  RBX: ff1e34360734b5d8  RCX: dead000000000200     RDX: 000000010001393f  RSI: ff1e34360734b5d8  RDI: ff1e343ca7eed5c0     RBP: ff1e343600019400   R8: ff1e343d37c73bb8   R9: 0000005885358800     R10: 0000000000000000  R11: ff1e343d37c71dc4  R12: 0000000000000000     R13: ff1e343600019420  R14: ff1e3436000194d0  R15: ff1e343ca7eed5c0     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018  #8 [ff4de14b444cbed8] worker_thread at ffffffff9c5197e0  #9 [ff4de14b444cbf10] kthread at ffffffff9c520e04 #10 [ff4de14b444cbf50] ret_from_fork at ffffffff9d00028f Another stack is below: [ 1656.060089] list_del corruption. next->prev should be ff4880c9d81b8d48, but was ff4880ccfb2d45e0 [ 1656.060536] ------------[ cut here ]------------ [ 1656.060538] kernel BUG at lib/list_debug.c:56! [ 1656.060738] invalid opcode: 0000 [#1] SMP NOPTI [ 1656.060872] CPU: 4 PID: 606 Comm: kworker/u40:27 Kdump: loaded Tainted: GF          OE     -------- -  - 4.18.0-553.el8_10.x86_64 #1 [ 1656.061130] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+1408+7b966129 04/01/2014 [ 1656.061261] Workqueue: mlx5_cmd_0000:11:00.0 cmd_work_handler [mlx5_core] [ 1656.061457] RIP: 0010:__list_del_entry_valid.cold.1+0x20/0x48 [ 1656.061586] Code: 45 d4 99 e8 5e 52 c7 ff 0f 0b 48 89 fe 48 89 c2 48 c7 c7 00 46 d4 99 e8 4a 52 c7 ff 0f 0b 48 c7 c7 b0 46 d4 99 e8 3c 52 c7 ff <0f> 0b 48 89 f2 48 89 fe 48 c7 c7 70 46 d4 99 e8 28 52 c7 ff 0f 0b [ 1656.061846] RSP: 0018:ff650559444dfe90 EFLAGS: 00010046 [ 1656.061974] RAX: 0000000000000054 RBX: ff4880c9d81b8d40 RCX: 0000000000000000 [ 1656.062103] RDX: 0000000000000000 RSI: ff4880cf9731e698 RDI: ff4880cf9731e698 [ 1656.062238] RBP: ff4880c840019400 R08: 0000000000000000 R09: c0000000ffff7fff [ 1656.062366] R10: 0000000000000001 R11: ff650559444dfcb0 R12: ff4880c862647b00 [ 1656.062492] R13: ff4880c879326540 R14: 0000000000000000 R15: ff4880c9d81b8d48 [ 1656.062619] FS:  0000000000000000(0000) GS:ff4880cf97300000(0000) knlGS:0000000000000000 [ 1656.062745] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1656.062868] CR2: 000055cc1af6b000 CR3: 000000084b610006 CR4: 0000000000771ee0 [ 1656.062996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1656.063127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1656.063250] PKRU: 55555554       KERNEL: /usr/lib/debug/lib/modules/4.18.0-553.el8_10.x86_64/vmlinux  [TAINTED]     DUMPFILE: vmcore  [PARTIAL DUMP]         CPUS: 20         DATE: Fri Nov 29 17:37:31 CST 2024       UPTIME: 00:27:35 LOAD AVERAGE: 350.47, 237.79, 163.91        TASKS: 1106     NODENAME: rocky8vm3      RELEASE: 4.18.0-553.el8_10.x86_64      VERSION: #1 SMP Fri May 24 13:05:10 UTC 2024      MACHINE: x86_64  (2599 Mhz)       MEMORY: 31.4 GB        PANIC: "kernel BUG at lib/list_debug.c:56!"          PID: 606      COMMAND: "kworker/u40:27"         TASK: ff4880c8793f8000  [THREAD_INFO: ff4880c8793f8000]          CPU: 4        STATE: TASK_RUNNING (PANIC)crash> bt PID: 606      TASK: ff4880c8793f8000  CPU: 4    COMMAND: "kworker/u40:27"  #0 [ff650559444dfc28] machine_kexec at ffffffff98a6f2d3  #1 [ff650559444dfc80] __crash_kexec at ffffffff98bbaa5a  #2 [ff650559444dfd40] crash_kexec at ffffffff98bbb991  #3 [ff650559444dfd58] oops_end at ffffffff98a2d811  #4 [ff650559444dfd78] do_trap at ffffffff98a29a27  #5 [ff650559444dfdc0] do_invalid_op at ffffffff98a2a766  #6 [ff650559444dfde0] invalid_op at ffffffff99600da4     [exception RIP: __list_del_entry_valid.cold.1+32]     RIP: ffffffff98ef8f98  RSP: ff650559444dfe90  RFLAGS: 00010046     RAX: 0000000000000054  RBX: ff4880c9d81b8d40  RCX: 0000000000000000     RDX: 0000000000000000  RSI: ff4880cf9731e698  RDI: ff4880cf9731e698     RBP: ff4880c840019400   R8: 0000000000000000   R9: c0000000ffff7fff     R10: 0000000000000001  R11: ff650559444dfcb0  R12: ff4880c862647b00     R13: ff4880c879326540  R14: 0000000000000000  R15: ff4880c9d81b8d48     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018  #7 [ff650559444dfe90] process_one_work at ffffffff98b19557  #8 [ff650559444dfed8] worker_thread at ffffffff98b197e0  #9 [ff650559444dff10] kthread at ffffffff98b20e04 #10 [ff650559444dff50] ret_from_fork at ffffffff9960028f This bug seems to be in rdma_cm module on the MOFED/kernel side. So we try to reproduce the crash on the Nvme-oF node: 1. Mount the nvme-of disk, do "nvme connect -t rdma -n "nqn.2014-08.org.nvmexpress:67240ebd3fa63ca3" -a 192.168.122.30 -s 4421" 2. Use dd run write/read test case, for example, "dd if=/dev/nvme0n17 of=./test bs=32K count=102400 oflag=direct" 3. Construct an ARP update, do "arp -s 192.168.122.112 10:71:fe:69:93:b8 && arp -d 192.168.122.112" on the nvme_of client. 4. The crash is already reproduce. The issue may involve the following key points: 1. The RDMA module receives multiple network events simultaneously. 2. We have observed that during normal ARP updates, one or more events may be generated, making this issue probabilistic. 3. When both ARP update events and connection termination (conn disconnect) events are received at the same time, it triggers issue LU-18275 . We are currently in contact with NVIDIA's network technology experts in China. If you have other channels, we could invite them to help solve the issue as well. Do you have any suggestions? Thank you.
            yuan.liu Yuan Liu added a comment -

            Hi eaujames,

            Are the 2sd interfaces of the nodes still accessible on the network? 

            We have changed the bond to a single Ethernet port as you mentioned earlier. So there just only 1 interface is accessible on the network.

            Can you retry with the sysctl parameters set on all the nodes (Lustre servers, routers and computes) ? 

            We have set the sysctl parameters, and the system has still work correctly. But the crash is not very easy to reproduce, so it is not possible to determine whether the problem is resolved or not.

             

            static int cma_netevent_callback(struct notifier_block *self,
                                             unsigned long event, void *ctx)
            {
            ....
                    list_for_each_entry(current_id, &ips_node->id_list, id_list_entry) {
                            if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr,   <-------
                                       neigh->ha, ETH_ALEN))
                                    continue;
                            INIT_WORK(&current_id->id.net_work, cma_netevent_work_handler);    <-----
                            cma_id_get(current_id);
                            queue_work(cma_wq, &current_id->id.net_work);
                    }
            ....
            } 

            The remotes GID/MAC address should not change.  There should be two reasons for If is true:

            1,current_id->id.route.addr.dev_addr.dst_dev_addr == 000000000000. I'm not sure if that's going to happen.

            2,neigh->ha == 000000000000. When arp entry had been deleted, it can happen.

            I'm not sure if my analysis is correct.  If that's the case with my analysis, do you have any suggestions for solving this kind of problem?

             

            yuan.liu Yuan Liu added a comment - Hi  eaujames , Are the 2sd interfaces of the nodes still accessible on the network? We have changed the bond to a single Ethernet port as you mentioned earlier. So there just only 1 interface is accessible on the network. Can you retry with the sysctl parameters set on all the nodes (Lustre servers, routers and computes) ? We have set the sysctl parameters, and the system has still work correctly. But the crash is not very easy to reproduce, so it is not possible to determine whether the problem is resolved or not.   static int cma_netevent_callback(struct notifier_block *self, unsigned long event, void *ctx) { .... list_for_each_entry(current_id, &ips_node->id_list, id_list_entry) { if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr, <------- neigh->ha, ETH_ALEN)) continue ; INIT_WORK(&current_id->id.net_work, cma_netevent_work_handler); <----- cma_id_get(current_id); queue_work(cma_wq, &current_id->id.net_work); } .... } The remotes GID/MAC address should not change.  There should be two reasons for If is true: 1,current_id->id.route.addr.dev_addr.dst_dev_addr == 000000000000. I'm not sure if that's going to happen. 2,neigh->ha == 000000000000. When arp entry had been deleted, it can happen. I'm not sure if my analysis is correct.  If that's the case with my analysis, do you have any suggestions for solving this kind of problem?  

            Are the 2sd interfaces of the nodes still accessible on the network?
            Can you retry with the sysctl parameters set on all the nodes (Lustre servers, routers and computes) ?

            The remotes GID/MAC address should not change. You can play with 'ip neigh flush/ip neigh show/ping' on a node to see if that is the case.

            Also, you can also try to identify in the crash the IP/MAC address of the remote node causing the issue in "id.route.addr" to verify if there is an inconsistency.

             struct rdma_addr { 
                     struct sockaddr_storage src_addr; 
                     struct sockaddr_storage dst_addr; 
                     struct rdma_dev_addr dev_addr; 
             }; 
            

            And you can try to check "cma_wq" workqueue in the crash to understand what went wrong.

            eaujames Etienne Aujames added a comment - Are the 2sd interfaces of the nodes still accessible on the network? Can you retry with the sysctl parameters set on all the nodes (Lustre servers, routers and computes) ? The remotes GID/MAC address should not change. You can play with 'ip neigh flush/ip neigh show/ping' on a node to see if that is the case. Also, you can also try to identify in the crash the IP/MAC address of the remote node causing the issue in "id.route.addr" to verify if there is an inconsistency. struct rdma_addr { struct sockaddr_storage src_addr; struct sockaddr_storage dst_addr; struct rdma_dev_addr dev_addr; }; And you can try to check "cma_wq" workqueue in the crash to understand what went wrong.
            yuan.liu Yuan Liu added a comment - - edited

            We have changed the bond to a single Ethernet port as you mentioned earlier. The problem has still been reproduced. The configuration of our kernel sysctl parameter is as follows:

            net.ipv4.conf.all.arp_announce = 0
            net.ipv4.conf.all.arp_ignore = 0
            net.ipv4.conf.default.arp_announce = 0
            net.ipv4.conf.default.arp_ignore = 0
            net.ipv4.conf.enp0s5f0np0.arp_announce = 0
            net.ipv4.conf.enp0s5f0np0.arp_ignore = 0
            net.ipv4.conf.enp0s5f1np1.arp_announce = 0
            net.ipv4.conf.enp0s5f1np1.arp_ignore = 0
            net.ipv4.conf.lo.arp_announce = 0
            net.ipv4.conf.lo.arp_ignore = 0
            
            yuan.liu Yuan Liu added a comment - - edited We have changed the bond to a single Ethernet port as you mentioned earlier. The problem has still been reproduced. The configuration of our kernel sysctl parameter is as follows: net.ipv4.conf.all.arp_announce = 0 net.ipv4.conf.all.arp_ignore = 0 net.ipv4.conf. default .arp_announce = 0 net.ipv4.conf. default .arp_ignore = 0 net.ipv4.conf.enp0s5f0np0.arp_announce = 0 net.ipv4.conf.enp0s5f0np0.arp_ignore = 0 net.ipv4.conf.enp0s5f1np1.arp_announce = 0 net.ipv4.conf.enp0s5f1np1.arp_ignore = 0 net.ipv4.conf.lo.arp_announce = 0 net.ipv4.conf.lo.arp_ignore = 0
            eaujames Etienne Aujames added a comment - - edited

            This won't be fixed. The issue seems to be on the MOFED/kernel side: the RoCE driver seems to not handle correctly failover for a bonding interface.

            I recommend to use Multi-Rail with several interfaces. If you have still the issue, this means that analysis is not correct.

            Also, when you use several interfaces on the same subnet, you have to make sure you used the following sysctl parameter:

            net.ipv4.conf.ib0.arp_ignore = 1
            net.ipv4.conf.ib0.arp_announce = 2
             
            net.ipv4.conf.ib1.arp_ignore = 1
            net.ipv4.conf.ib1.arp_announce = 2
            

            This will make sure that only the interface with the configured IP can reply to the ARP request. Otherwise, the node could try to connect to the wrong interface (not with the right NID).

            eaujames Etienne Aujames added a comment - - edited This won't be fixed. The issue seems to be on the MOFED/kernel side: the RoCE driver seems to not handle correctly failover for a bonding interface. I recommend to use Multi-Rail with several interfaces. If you have still the issue, this means that analysis is not correct. Also, when you use several interfaces on the same subnet, you have to make sure you used the following sysctl parameter: net.ipv4.conf.ib0.arp_ignore = 1 net.ipv4.conf.ib0.arp_announce = 2 net.ipv4.conf.ib1.arp_ignore = 1 net.ipv4.conf.ib1.arp_announce = 2 This will make sure that only the interface with the configured IP can reply to the ARP request. Otherwise, the node could try to connect to the wrong interface (not with the right NID).
            yuan.liu Yuan Liu added a comment -

            Hi eaujames,

            This seems to be a workqueue corruption (pwq == NULL). It happens when a work is re-added in the workqueue (check this patch).

            The re-added issue has come up several times, Is there a patch to fix this now? 

            Looking forward to hearing from you.

            yuan.liu Yuan Liu added a comment - Hi  eaujames , This seems to be a workqueue corruption (pwq == NULL). It happens when a work is re-added in the workqueue (check this patch). The re-added issue has come up several times, Is there a patch to fix this now?  Looking forward to hearing from you.

            Thanks Etienne for your analysis, it does look correct to me

            ssmirnov Serguei Smirnov added a comment - Thanks Etienne for your analysis, it does look correct to me
            eaujames Etienne Aujames added a comment - - edited

            Hi xiyan

            This seems to be a workqueue corruption (pwq == NULL). It happens when a work is re-added in the workqueue (check this patch).

            The QP and kiblnd_conn seems to be already freed, you can verify that with kmem (crash cmd).

            The event came from "cma_netevent_work_handler", this seems to be send on ARP cache update. And if an entry of a connection is updated, this will generated an UNREACHABLE event (that explained why you received UNREACHABLE event after disconnecting).

            static int cma_netevent_callback(struct notifier_block *self,
                                             unsigned long event, void *ctx)
            {
                    struct id_table_entry *ips_node = NULL;
                    struct rdma_id_private *current_id;
                    struct neighbour *neigh = ctx;
                    unsigned long flags;
            
                    if (event != NETEVENT_NEIGH_UPDATE)
                            return NOTIFY_DONE;
            ....
                    list_for_each_entry(current_id, &ips_node->id_list, id_list_entry) {
                            if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr,   <-------
                                       neigh->ha, ETH_ALEN))
                                    continue;
                            INIT_WORK(&current_id->id.net_work, cma_netevent_work_handler);    <-----
                            cma_id_get(current_id);
                            queue_work(cma_wq, &current_id->id.net_work);
                    }
            ....
            }
            
            static void cma_netevent_work_handler(struct work_struct *_work)
            {
            ....
                    event.event = RDMA_CM_EVENT_UNREACHABLE;
                    event.status = -ETIMEDOUT;
            ....
            } 

            The "net_work" is not cancel when removing the id. It waits the work to be executed. This seems to be a corner case not handled properly by the MOFED.

            I guess that the root issue here is to use bonding (failover) on RoCE interface, I think this can produce random "flip-flap" for remote RDMA device, ARP change and then produce the UNREACHABLE events.

            The recommended way to use several RDMA interfaces with Lustre is Multi-Rail feature. If you want failover interfaces there is UDSP (but the feature is new in 2.15). But note here that I am not an expert in those fields.

             

            eaujames Etienne Aujames added a comment - - edited Hi  xiyan , This seems to be a workqueue corruption (pwq == NULL). It happens when a work is re-added in the workqueue (check this  patch ). The QP and kiblnd_conn seems to be already freed, you can verify that with kmem (crash cmd). The event came from "cma_netevent_work_handler", this seems to be send on ARP cache update. And if an entry of a connection is updated, this will generated an UNREACHABLE event (that explained why you received UNREACHABLE event after disconnecting). static int cma_netevent_callback(struct notifier_block *self, unsigned long event, void *ctx) { struct id_table_entry *ips_node = NULL; struct rdma_id_private *current_id; struct neighbour *neigh = ctx; unsigned long flags; if (event != NETEVENT_NEIGH_UPDATE) return NOTIFY_DONE; .... list_for_each_entry(current_id, &ips_node->id_list, id_list_entry) { if (!memcmp(current_id->id.route.addr.dev_addr.dst_dev_addr, <------- neigh->ha, ETH_ALEN)) continue ; INIT_WORK(&current_id->id.net_work, cma_netevent_work_handler); <----- cma_id_get(current_id); queue_work(cma_wq, &current_id->id.net_work); } .... } static void cma_netevent_work_handler(struct work_struct *_work) { .... event.event = RDMA_CM_EVENT_UNREACHABLE; event.status = -ETIMEDOUT; .... } The "net_work" is not cancel when removing the id. It waits the work to be executed. This seems to be a corner case not handled properly by the MOFED. I guess  that the root issue here is to use bonding (failover) on RoCE interface, I think this can produce random "flip-flap" for remote RDMA device, ARP change and then produce the UNREACHABLE events. The recommended way to use several RDMA interfaces with Lustre is  Multi-Rail  feature. If you want failover interfaces there is  UDSP  (but the feature is new in 2.15). But note here that I am not an expert in those fields.  
            xiyan Rongyao Peng added a comment - - edited

            static void process_one_work(struct worker *worker, struct work_struct *work)
            __releases(&pool->lock)
            __acquires(&pool->lock)
            {
                    struct pool_workqueue *pwq = get_work_pwq(work);
                    struct worker_pool *pool = worker->pool;
                    bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;
                    int work_color;
                    struct worker *collision;
            #ifdef CONFIG_LOCKDEP
                    /*
                     * It is permissible to free the struct work_struct from
                     * inside the function that is called from it, this we need to
                     * take into account for lockdep too.  To avoid bogus "held
                     * lock freed" warnings as well as problems when looking into
                     * work->lockdep_map, make a copy and use that here.
                     */
                    struct lockdep_map lockdep_map;
                    ...
            }

            We analyze the stack information and get the following points:

            crash> struct work_struct ff1f28f60a7575d8

            struct work_struct {

            data = {

            counter = 1280

            },

            entry = {

            next = 0xff1f28f60a7575e0,

            prev = 0xff1f28f60a7575e0

            },

            func = 0xffffffffc0392640 <cma_netevent_work_handler>,

            {

            bdi_wb_backptr = 0x0,

            rh_kabi_hidden_111 = {

            rh_reserved1 = 0

            },


            {<No data fields>}

            },

            rh_reserved2 = 0,

            rh_reserved3 = 0,

            rh_reserved4 = 0

            }

            crash> struct worker ff1f28f603dca840

            struct worker {

            {

            entry = {

            next = 0x0,

            prev = 0x0

            },

            hentry = {

            next = 0x0,

            pprev = 0x0

            }

            },

            current_work = 0x0,

            current_func = 0x0,

            current_pwq = 0x0,

            scheduled = {

            next = 0xff1f28f603dca868,

            prev = 0xff1f28f603dca868

            },

            task = 0xff1f28f603dcc000,

            pool = 0xff1f28f600019400,

            node = {

            next = 0xff1f28f603dde1c8,

            prev = 0xff1f28f6aba3f908

            },

            last_active = 4294879534,

            flags = 128,

            id = 12,

            sleeping = 0,

            desc = "rdma_cm\000nbound\000\060:05.0\000\061",

            rescue_wq = 0x0,

            last_func = 0xffffffffc0392640 <cma_netevent_work_handler>

            }

            crash> struct rdma_cm_id ff1f28f60a757400

            struct rdma_cm_id {

            device = 0xff1f28f6146cc000,

            context = 0xff1f28f684ff8200,

            qp = 0xff1f28f698be1000,

            event_handler = 0xffffffffc0f0c720 <kiblnd_cm_callback>,

            .........................................................................................

            dev_type = 1,

            bound_dev_if = 5,

            transport = RDMA_TRANSPORT_IB,

            net = 0xffffffffafb404c0 <init_net>,

            sgid_attr = 0xff1f28f657c0b548,

            network = RDMA_NETWORK_IB,

            hoplimit = 64

            }

            },

            path_rec = 0xff1f28f688383ba0,

            path_rec_inbound = 0x0,

            path_rec_outbound = 0x0,

            num_pri_alt_paths = 1

            },

            ps = RDMA_PS_TCP,

            qp_type = IB_QPT_RC,

            port_num = 1,

            ..................................................................

            }

             

            crash> struct kib_conn 0xff1f28f684ff8200

            struct kib_conn {

            ibc_sched = 0xff1f28f60898cb00,

            ibc_peer = 0xff1f28f6882da9c0,

            ibc_hdev = 0xff1f28f676b4aa00,

            ibc_list = {

            next = 0xdead000000000100,

            prev = 0xdead000000000200

            },

            ibc_sched_list = {

            next = 0xdead000000000100,

            prev = 0xdead000000000200

            },

            ibc_version = 18,

            ibc_reconnect = 0,

            ibc_incarnation = 1728558198980131,

            ibc_refcount = {

            counter = 1

            },

            ibc_state = 5,

            ibc_nsends_posted = 0,

            ibc_noops_posted = 0,

            ibc_credits = 0,

            ibc_outstanding_credits = 0,

            ibc_reserved_credits = 8,

            ibc_comms_error = -5,

            ibc_queue_depth = 8,

            ibc_max_frags = 257,

            ibc_waits = 0,

            ibc_nrx = 0,

            ibc_scheduled = 0,

            ibc_ready = 0,

            ibc_last_send = 179167595058,

            ibc_connd_list = {

            next = 0xff1f28f684ff8280,

            prev = 0xff1f28f684ff8280

            },

            ibc_early_rxs = {

            next = 0xff1f28f684ff8290,

            prev = 0xff1f28f684ff8290

            },

            ibc_tx_noops = {

            next = 0xff1f28f684ff82a0,

            prev = 0xff1f28f684ff82a0

            },

            ibc_tx_queue = {

            next = 0xff1f28f684ff82b0,

            prev = 0xff1f28f684ff82b0

            },

            ibc_tx_queue_nocred = {

            next = 0xff1f28f684ff82c0,

            prev = 0xff1f28f684ff82c0

            },

            ibc_tx_queue_rsrvd = {

            next = 0xff1f28f684ff82d0,

            prev = 0xff1f28f684ff82d0

            },

            ibc_active_txs = {

            next = 0xff1f28f684ff82e0,

            prev = 0xff1f28f684ff82e0

            ...

            }

            When we did a test case to reboot lustre server, another lustre server crashed when umount was triggered.

            Another situation is that after rebooting the lustre server, another lustre server crashes after a while.

            This problem exists regardless of whether the following patches are applied or not.

            LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd
            LU-17480 o2iblnd: add a timeout for rdma_connect
            LU-16184 o2iblnd: fix deadline for tx on peer queue
            LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR
            LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
            LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE

             

            xiyan Rongyao Peng added a comment - - edited static void process_one_work(struct worker *worker, struct work_struct *work) __releases(&pool->lock) __acquires(&pool->lock) {         struct pool_workqueue *pwq = get_work_pwq(work);         struct worker_pool *pool = worker->pool;         bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;         int work_color;         struct worker *collision; #ifdef CONFIG_LOCKDEP         /*          * It is permissible to free the struct work_struct from          * inside the function that is called from it, this we need to          * take into account for lockdep too.  To avoid bogus "held          * lock freed" warnings as well as problems when looking into          * work->lockdep_map, make a copy and use that here.          */         struct lockdep_map lockdep_map;         ... } We analyze the stack information and get the following points: crash> struct work_struct ff1f28f60a7575d8 struct work_struct { data = { counter = 1280 }, entry = { next = 0xff1f28f60a7575e0, prev = 0xff1f28f60a7575e0 }, func = 0xffffffffc0392640 < cma_netevent_work_handler >, { bdi_wb_backptr = 0x0, rh_kabi_hidden_111 = { rh_reserved1 = 0 }, {<No data fields>} }, rh_reserved2 = 0, rh_reserved3 = 0, rh_reserved4 = 0 } crash> struct worker ff1f28f603dca840 struct worker { { entry = { next = 0x0, prev = 0x0 }, hentry = { next = 0x0, pprev = 0x0 } }, current_work = 0x0, current_func = 0x0, current_pwq = 0x0, scheduled = { next = 0xff1f28f603dca868, prev = 0xff1f28f603dca868 }, task = 0xff1f28f603dcc000, pool = 0xff1f28f600019400, node = { next = 0xff1f28f603dde1c8, prev = 0xff1f28f6aba3f908 }, last_active = 4294879534, flags = 128, id = 12, sleeping = 0, desc = "rdma_cm\000nbound\000\060:05.0\000\061", rescue_wq = 0x0, last_func = 0xffffffffc0392640 < cma_netevent_work_handler > } crash> struct rdma_cm_id ff1f28f60a757400 struct rdma_cm_id { device = 0xff1f28f6146cc000, context = 0xff1f28f684ff8200, qp = 0xff1f28f698be1000, event_handler = 0xffffffffc0f0c720 < kiblnd_cm_callback >, ......................................................................................... dev_type = 1, bound_dev_if = 5, transport = RDMA_TRANSPORT_IB, net = 0xffffffffafb404c0 <init_net>, sgid_attr = 0xff1f28f657c0b548, network = RDMA_NETWORK_IB, hoplimit = 64 } }, path_rec = 0xff1f28f688383ba0, path_rec_inbound = 0x0, path_rec_outbound = 0x0, num_pri_alt_paths = 1 }, ps = RDMA_PS_TCP, qp_type = IB_QPT_RC, port_num = 1, .................................................................. }   crash> struct kib_conn 0xff1f28f684ff8200 struct kib_conn { ibc_sched = 0xff1f28f60898cb00, ibc_peer = 0xff1f28f6882da9c0, ibc_hdev = 0xff1f28f676b4aa00, ibc_list = { next = 0xdead000000000100, prev = 0xdead000000000200 }, ibc_sched_list = { next = 0xdead000000000100, prev = 0xdead000000000200 }, ibc_version = 18, ibc_reconnect = 0, ibc_incarnation = 1728558198980131, ibc_refcount = { counter = 1 }, ibc_state = 5, ibc_nsends_posted = 0, ibc_noops_posted = 0, ibc_credits = 0, ibc_outstanding_credits = 0, ibc_reserved_credits = 8, ibc_comms_error = -5, ibc_queue_depth = 8, ibc_max_frags = 257, ibc_waits = 0, ibc_nrx = 0, ibc_scheduled = 0, ibc_ready = 0, ibc_last_send = 179167595058, ibc_connd_list = { next = 0xff1f28f684ff8280, prev = 0xff1f28f684ff8280 }, ibc_early_rxs = { next = 0xff1f28f684ff8290, prev = 0xff1f28f684ff8290 }, ibc_tx_noops = { next = 0xff1f28f684ff82a0, prev = 0xff1f28f684ff82a0 }, ibc_tx_queue = { next = 0xff1f28f684ff82b0, prev = 0xff1f28f684ff82b0 }, ibc_tx_queue_nocred = { next = 0xff1f28f684ff82c0, prev = 0xff1f28f684ff82c0 }, ibc_tx_queue_rsrvd = { next = 0xff1f28f684ff82d0, prev = 0xff1f28f684ff82d0 }, ibc_active_txs = { next = 0xff1f28f684ff82e0, prev = 0xff1f28f684ff82e0 ... } When we did a test case to reboot lustre server, another lustre server crashed when umount was triggered. Another situation is that after rebooting the lustre server, another lustre server crashes after a while. This problem exists regardless of whether the following patches are applied or not. LU-18260  o2iblnd: fix race between REJ vs kiblnd_connd LU-17480  o2iblnd: add a timeout for rdma_connect LU-16184  o2iblnd: fix deadline for tx on peer queue LU-17632  o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR LU-17325  o2iblnd: CM_EVENT_UNREACHABLE on established conn LU-15885  o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE  

            People

              xiyan Rongyao Peng
              xiyan Rongyao Peng
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: