Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18275

o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.5
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 bond
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Construct lustre MDT(mdt0-10.255.153.128@o2ib) restart
      • Crash occurs on other lustre servers

      Server call trace:

      crash> bt
      PID: 568423   TASK: ff4787632aa5c000  CPU: 5    COMMAND: "kworker/u40:0"
       #0 [ff7d728b15e6baa0] machine_kexec at ffffffff8fa6f353
       #1 [ff7d728b15e6baf8] __crash_kexec at ffffffff8fbbaa7a
       #2 [ff7d728b15e6bbb8] crash_kexec at ffffffff8fbbb9b1
       #3 [ff7d728b15e6bbd0] oops_end at ffffffff8fa2d831
       #4 [ff7d728b15e6bbf0] no_context at ffffffff8fa81cf3
       #5 [ff7d728b15e6bc48] __bad_area_nosemaphore at ffffffff8fa8206c
       #6 [ff7d728b15e6bc90] do_page_fault at ffffffff8fa82cf7
       #7 [ff7d728b15e6bcc0] page_fault at ffffffff906011ae
          [exception RIP: kiblnd_cm_callback+2653]
          RIP: ffffffffc0efe00d  RSP: ff7d728b15e6bd70  RFLAGS: 00010246
          RAX: 0000000000000007  RBX: ff7d728b15e6be08  RCX: 0000000000000000
          RDX: ff4787632aa5c000  RSI: ff7d728b15e6be08  RDI: ff47876095213c00
          RBP: ff47876095213c00   R8: 0000000000000000   R9: 006d635f616d6472
          R10: 8080808080808080  R11: 0000000000000000  R12: ff47876095213c00
          R13: 0000000000000000  R14: 0000000000000000  R15: ff47876095213de0
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #8 [ff7d728b15e6bdd8] cma_cm_event_handler at ffffffffc04729a5 [rdma_cm]
       #9 [ff7d728b15e6be00] cma_netevent_work_handler at ffffffffc04786b5 [rdma_cm]
      #10 [ff7d728b15e6be90] process_one_work at ffffffff8fb195e3
      #11 [ff7d728b15e6bed8] worker_thread at ffffffff8fb197d0
      #12 [ff7d728b15e6bf10] kthread at ffffffff8fb20e24
      #13 [ff7d728b15e6bf50] ret_from_fork at ffffffff9060028f 

      Server kernel log:

      [69106.143672] LustreError: 11-0: lustre-MDT000a-osp-MDT0007: operation mds_statfs to node 10.255.153.128@o2ib failed: rc = -107
      [69106.143700] Lustre: lustre-OST0038-osc-MDT0007: Connection to lustre-OST0038 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69106.145053] LustreError: Skipped 29 previous similar messages
      [69106.145060] Lustre: Skipped 203 previous similar messages
      [69111.263490] Lustre: lustre-OST004e-osc-MDT0007: Connection to lustre-OST004e (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69111.263496] Lustre: Skipped 6 previous similar messages
      [69112.506974] Lustre: lustre-OST0038-osc-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69112.506980] Lustre: Skipped 195 previous similar messages
      [69116.383952] Lustre: lustre-MDT0000-lwp-MDT0007: Connection to lustre-MDT0000 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [69116.383965] Lustre: Skipped 3 previous similar messages
      [69122.528880] Lustre: lustre-MDT000a-lwp-OST0013: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69122.528885] Lustre: Skipped 3 previous similar messages
      [69127.659951] Lustre: lustre-OST0043-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69127.659960] Lustre: Skipped 7 previous similar messages
      [69138.672269] Lustre: lustre-OST002d-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69138.672275] Lustre: Skipped 3 previous similar messages
      [69158.168201] Lustre: lustre-MDT000a-osp-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
      [69158.168206] Lustre: Skipped 1 previous similar message
      [69178.775546] Lustre: lustre-MDT0000-osp-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
      [69178.775554] Lustre: Skipped 11 previous similar messages
      [69178.805333] Lustre: lustre-OST0008: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.805386] Lustre: lustre-OST0013: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.805505] Lustre: lustre-OST001e: deleting orphan objects from 0x0:13846 to 0x0:13921
      [69178.805729] Lustre: lustre-OST003f: deleting orphan objects from 0x0:13822 to 0x0:13857
      [69178.806122] Lustre: lustre-OST004a: deleting orphan objects from 0x0:13855 to 0x0:13889
      [69178.806130] Lustre: lustre-OST0055: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.807518] Lustre: lustre-OST0029: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.807537] Lustre: lustre-OST0034: deleting orphan objects from 0x0:13854 to 0x0:13889
      [69178.838633] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0034: Delaying reintegration for qtype:2 until pending updates are flushed.
      [69178.840301] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 1 previous similar message
      [69180.358640] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0013: Delaying reintegration for qtype:2 until pending updates are flushed.
      [69180.359161] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 2 previous similar messages
      [69239.967382] BUG: unable to handle kernel NULL pointer dereference at 000000000000004c
      [69239.968927] PGD 0 
      [69239.969144] Oops: 0000 [#1] SMP NOPTI
      [69239.969327] CPU: 5 PID: 568423 Comm: kworker/u40:0 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
      [69239.969650] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
      [69239.969792] Workqueue: rdma_cm cma_netevent_work_handler [rdma_cm]
      [69239.969995] RIP: 0010:kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
      [69239.970191] Code: 48 89 05 06 7f 01 00 c7 05 04 7f 01 00 00 00 02 02 48 c7 05 01 7f 01 00 d0 5e f1 c0 e8 ac 66 ee ff e9 b5 f7 ff ff 4c 8b 6f 08 <41> 8b 6d 4c f6 05 e5 d4 f0 ff 01 0f 84 8d 00 00 00 f6 05 dc d4 f0
      [69239.970483] RSP: 0018:ff7d728b15e6bd70 EFLAGS: 00010246
      [69239.970621] RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000
      [69239.970763] RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00
      [69239.970895] RBP: ff47876095213c00 R08: 0000000000000000 R09: 006d635f616d6472
      [69239.971025] R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00
      [69239.971155] R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0
      [69239.971289] FS:  0000000000000000(0000) GS:ff47877d7f740000(0000) knlGS:0000000000000000
      [69239.971425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [69239.971570] CR2: 000000000000004c CR3: 0000001900e10004 CR4: 0000000000771ee0
      [69239.971708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [69239.971841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [69239.971971] PKRU: 55555554
      [69239.972099] Call Trace:
      [69239.972267]  ? __die_body+0x1a/0x60
      [69239.972451]  ? no_context+0x1ba/0x3f0
      [69239.972583]  ? __bad_area_nosemaphore+0x16c/0x1c0
      [69239.972705]  ? do_page_fault+0x37/0x12d
      [69239.972826]  ? page_fault+0x1e/0x30
      [69239.972971]  ? kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
      [69239.973099]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
      [69239.973234]  cma_netevent_work_handler+0x75/0xd0 [rdma_cm]
      [69239.973362]  process_one_work+0x1d3/0x390
      [69239.973516]  worker_thread+0x30/0x390
      [69239.973631]  ? process_one_work+0x390/0x390
      [69239.973743]  kthread+0x134/0x150
      [69239.973863]  ? set_kthread_struct+0x50/0x50
      [69239.973976]  ret_from_fork+0x1f/0x40
      [69239.974099] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ldiskfs(OE) mbcache jbd2 ko2iblnd(OE) lnet(OE) libcfs(OE) bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm cirrus drm_shmem_helper kvm_intel kvm irqbypass drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect ghash_clmulni_intel sysimgblt rapl drm i2c_piix4 pcspkr virtio_balloon joydev knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ata_generic mlx5_core(OE) mlxfw(OE) ata_piix psample pci_hyperv_intf tls crc32c_intel virtio_console virtio_blk libata serio_raw mlxdevm(OE) xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) nvme_fabrics(OE) nvme_core(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) t10_pi
      [69239.975388] CR2: 000000000000004c 

      Attachments

        Issue Links

          Activity

            [LU-18275] o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

            LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd
            LU-17480 o2iblnd: add a timeout for rdma_connect
            LU-16184 o2iblnd: fix deadline for tx on peer queue
            LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR
            LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
            LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE

            The backports will be part of the next CEA's Lustre release.

            eaujames Etienne Aujames added a comment - LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd LU-17480 o2iblnd: add a timeout for rdma_connect LU-16184 o2iblnd: fix deadline for tx on peer queue LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE The backports will be part of the next CEA's Lustre release.

            xiyan, thanks for testing the backports. I will tags these backports with LTS15.

            "cmid->context = NULL;" is set in the "RDMA_CM_EVENT_DISCONNECTED" event (0x4c is the offset of ibc_state):

            kiblnd_cm_callback(
            ...
                    case RDMA_CM_EVENT_DISCONNECTED:
                            conn = cmid->context;
                            if (conn->ibc_state < IBLND_CONN_ESTABLISHED) {
                                    CERROR("%s DISCONNECTED\n",
                                           libcfs_nid2str(conn->ibc_peer->ibp_nid));
                                    kiblnd_connreq_done(conn, -ECONNRESET);
                            } else {
                                    kiblnd_close_conn(conn, 0);
                            }
                            kiblnd_conn_decref(conn);
                            cmid->context = NULL;         <-------
                            return 0;
            

            That would mean that the RDMA_CM_EVENT_UNREACHABLE is received after the RDMA_CM_EVENT_DISCONNECTED? A race is possible in that case, but I am not sure how this can happen.

            And I don't know why we set "cmid->context = NULL" in kiblnd_cm_callback()

            eaujames Etienne Aujames added a comment - xiyan , thanks for testing the backports. I will tags these backports with LTS15. "cmid->context = NULL;" is set in the "RDMA_CM_EVENT_DISCONNECTED" event (0x4c is the offset of ibc_state): kiblnd_cm_callback( ... case RDMA_CM_EVENT_DISCONNECTED: conn = cmid->context; if (conn->ibc_state < IBLND_CONN_ESTABLISHED) { CERROR( "%s DISCONNECTED\n" , libcfs_nid2str(conn->ibc_peer->ibp_nid)); kiblnd_connreq_done(conn, -ECONNRESET); } else { kiblnd_close_conn(conn, 0); } kiblnd_conn_decref(conn); cmid->context = NULL ; <------- return 0; That would mean that the RDMA_CM_EVENT_UNREACHABLE is received after the RDMA_CM_EVENT_DISCONNECTED? A race is possible in that case, but I am not sure how this can happen. And I don't know why we set "cmid->context = NULL" in kiblnd_cm_callback()

            After discussing issue LU-18260 with eaujames, the issue no longer reappears after applying the following patch.

            LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd
            LU-17480 o2iblnd: add a timeout for rdma_connect
            LU-16184 o2iblnd: fix deadline for tx on peer queue
            LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR
            LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
            LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE

            I think the key patches should be LU-17480 and LU-16184. Of course, these are only temporary solutions. We will need to continue to locate the root cause in the future.

            xiyan Rongyao Peng added a comment - After discussing issue LU-18260 with eaujames , the issue no longer reappears after applying the following patch. LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd LU-17480 o2iblnd: add a timeout for rdma_connect LU-16184 o2iblnd: fix deadline for tx on peer queue LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE I think the key patches should be LU-17480 and LU-16184 . Of course, these are only temporary solutions. We will need to continue to locate the root cause in the future.

            I suspect that LU-17480 can reduce the probability of this problem because the current RDMA conn timeout is relatively long. As long as the RDMA conn is destroyed before the faulty host is powered on, I guess the probability of this problem will become smaller. But we also need to find its root cause.

            xiyan Rongyao Peng added a comment - I suspect that LU-17480 can reduce the probability of this problem because the current RDMA conn timeout is relatively long. As long as the RDMA conn is destroyed before the faulty host is powered on, I guess the probability of this problem will become smaller. But we also need to find its root cause.

             

            mdt0-10.255.153.128@o2ib log:

            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6033] device (bond1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
            Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Using default interface naming scheme 'rhel-8.0'.
            Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
            Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Could not generate persistent MAC address for bond1: No such file or directory
            Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
            Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6062] device (bond1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6073] policy: auto-activating connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab)
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6076] device (bond1): Activation: starting connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab)
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6077] device (bond1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6079] manager: NetworkManager state is now CONNECTING
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6083] device (bond1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6085] device (bond1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
            Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6089] policy: set 'bond1' (bond1) as default for IPv4 routing and DNS
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6099] device (bond1): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6113] device (bond1): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6114] device (bond1): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6117] manager: NetworkManager state is now CONNECTED_SITE
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6120] device (bond1): Activation: successful, device activated.
            Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6125] manager: NetworkManager state is now CONNECTED_GLOBAL
            Sep 27 15:55:34 node28-lustre NetworkManager[1041]: <info>  [1727423734.5862] manager: startup complete
            Sep 27 15:55:34 node28-lustre systemd[1]: Started Network Manager Wait Online.
            Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Network is Online.
            Sep 27 15:55:34 node28-lustre systemd[1]: Starting Notify NFS peers of a restart...
            Sep 27 15:55:34 node28-lustre systemd[1]: Starting System Logging Service...
            Sep 27 15:55:34 node28-lustre systemd[1]: Starting Crash recovery kernel arming...
            Sep 27 15:55:34 node28-lustre sm-notify[1513]: Version 2.3.3 starting
            Sep 27 15:55:34 node28-lustre systemd[1]: Started Notify NFS peers of a restart.
            Sep 27 15:55:34 node28-lustre rsyslogd[1514]: [origin software="rsyslogd" swVersion="8.2102.0-15.el8" x-pid="1514" x-info="https://www.rsyslog.com"] start
            Sep 27 15:55:34 node28-lustre systemd[1]: Started System Logging Service.
            Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Multi-User System.
            Sep 27 15:55:34 node28-lustre systemd[1]: Started nvme mount.
            Sep 27 15:55:34 node28-lustre systemd[1]: Starting Update UTMP about System Runlevel Changes...
            Sep 27 15:55:34 node28-lustre systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
            Sep 27 15:55:34 node28-lustre systemd[1]: Started Update UTMP about System Runlevel Changes.
            Sep 27 15:55:34 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:34 node28-lustre rsyslogd[1514]: imjournal: journal files changed, reloading...  [v8.2102.0-15.el8 try https://www.rsyslog.com/e/0 ]
            Sep 27 15:55:36 node28-lustre kernel: alg: No test for pkcs1pad(rsa,sha1) (pkcs1pad(rsa-generic,sha1))
            Sep 27 15:55:36 node28-lustre kernel: PKCS7: Message signed outside of X.509 validity window
            Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: kexec: loaded kdump kernel
            Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: Starting kdump: [OK]
            Sep 27 15:55:36 node28-lustre systemd[1]: Started Crash recovery kernel arming.
            Sep 27 15:55:36 node28-lustre systemd[1]: Startup finished in 1.123s (kernel) + 2.553s (initrd) + 9.235s (userspace) = 12.912s.
            Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
            Sep 27 15:55:37 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens5np0: link becomes ready
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4081] device (ens5np0): carrier: link connected
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4092] device (ens5np0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4098] policy: auto-activating connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a)
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4103] device (ens5np0): Activation: starting connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a)
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4104] device (ens5np0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4107] device (ens5np0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4114] device (ens5np0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
            Sep 27 15:55:37 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link down
            Sep 27 15:55:37 node28-lustre kernel: bond1: (slave ens5np0): Enslaving as a backup interface with a down link
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7306] device (bond1): attached bond port ens5np0
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7307] device (ens5np0): Activation: connection 'bond-slave-ens5np0' enslaved, continuing activation
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7425] device (ens5np0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7439] device (ens5np0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7441] device (ens5np0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7446] device (ens5np0): Activation: successful, device activated.
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7077] device (ens6np1): carrier: link connected
            Sep 27 15:55:38 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up
            Sep 27 15:55:38 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens6np1: link becomes ready
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7085] device (ens6np1): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7089] policy: auto-activating connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c)
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): Activation: starting connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c)
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7094] device (ens6np1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7096] device (ens6np1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:39 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down
            Sep 27 15:55:39 node28-lustre kernel: bond1: (slave ens6np1): Enslaving as a backup interface with a down link
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (bond1): attached bond port ens6np1
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (ens6np1): Activation: connection 'bond-slave-ens6np1' enslaved, continuing activation
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0190] device (ens6np1): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0202] device (ens6np1): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0203] device (ens6np1): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
            Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0207] device (ens6np1): Activation: successful, device activated.
            Sep 27 15:55:40 node28-lustre systemd[1]: Stopping RDMA Node Description Daemon...
            Sep 27 15:55:40 node28-lustre systemd[1]: rdma-ndd.service: Succeeded.
            Sep 27 15:55:40 node28-lustre systemd[1]: Stopped RDMA Node Description Daemon.
            Sep 27 15:55:40 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
            Sep 27 15:55:40 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
            Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: shared_fdb:0 mode:hash
            Sep 27 15:55:42 node28-lustre systemd[1]: /usr/lib/systemd/system/rdma-ndd.service:25: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
            Sep 27 15:55:42 node28-lustre systemd[1]: Starting RDMA Node Description Daemon...
            Sep 27 15:55:42 node28-lustre systemd[1]: Started RDMA Node Description Daemon.
            Sep 27 15:55:43 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
            Sep 27 15:55:43 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:46 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
            Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
            Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.0599] device (ens5np0): carrier: link connected
            Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down
            Sep 27 15:55:47 node28-lustre kernel: bond1: (slave ens5np0): link status definitely up, 100000 Mbps full duplex
            Sep 27 15:55:47 node28-lustre kernel: bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
            Sep 27 15:55:47 node28-lustre kernel: bond1: active interface up!
            Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.1252] device (bond1): carrier: link connected
            Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1
            Sep 27 15:55:48 node28-lustre root[1979]: openibd: Set node_desc for mlx5_bond_0: node28-lustre HCA-1
            Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up
            Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
            Sep 27 15:55:48 node28-lustre NetworkManager[1041]: <info>  [1727423748.3576] device (ens6np1): carrier: link connected
            Sep 27 15:55:48 node28-lustre kernel: bond1: (slave ens6np1): link status definitely up, 100000 Mbps full duplex
            Sep 27 15:55:49 node28-lustre systemd[1]: NetworkManager-dispatcher.service: Succeeded.
            Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
            Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1
            Sep 27 15:55:50 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
            Sep 27 15:55:55 node28-lustre systemd[1]: Created slice User Slice of UID 0.
            Sep 27 15:55:55 node28-lustre systemd[1]: Starting User runtime directory /run/user/0...
            Sep 27 15:55:55 node28-lustre systemd-logind[1043]: New session 1 of user root.
            Sep 27 15:55:55 node28-lustre systemd[1]: Started User runtime directory /run/user/0.
            Sep 27 15:55:55 node28-lustre systemd[1]: Starting User Manager for UID 0...
            Sep 27 15:55:56 node28-lustre systemd[1989]: Starting D-Bus User Message Bus Socket.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Paths.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Timers.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Listening on D-Bus User Message Bus Socket.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Sockets.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Basic System.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Default.
            Sep 27 15:55:56 node28-lustre systemd[1989]: Startup finished in 24ms.
            Sep 27 15:55:56 node28-lustre systemd[1]: Started User Manager for UID 0.
            Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 1 of user root.
            Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 3 of user root.
            Sep 27 15:55:56 node28-lustre systemd-logind[1043]: New session 3 of user root.
            Sep 27 15:55:56 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:56 node28-lustre kernel: nvme nvme0: creating 14 I/O queues.
            Sep 27 15:55:56 node28-lustre kernel: mlx5_ib: mlx5_set_umr_free_mkey: Translation mode supported only when access_mode is MTT or PA
            Sep 27 15:55:58 node28-lustre systemd[1]: systemd-hostnamed.service: Succeeded.
            Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: mapped 14/0/0 default/read/poll queues.
            Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.26:4421
            Sep 27 15:55:59 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:55:59 node28-lustre kernel: nvme nvme1: creating 14 I/O queues.
            Sep 27 15:56:01 node28-lustre systemd[1]: Started Session 4 of user root.
            Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: mapped 14/0/0 default/read/poll queues.
            Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.133:4421
            Sep 27 15:56:02 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:56:02 node28-lustre kernel: nvme nvme2: creating 14 I/O queues.
            Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: mapped 14/0/0 default/read/poll queues.
            Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.28:4421
            Sep 27 15:56:05 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
            Sep 27 15:56:05 node28-lustre kernel: nvme nvme3: creating 14 I/O queues.
            Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: mapped 14/0/0 default/read/poll queues.
            Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.29:4421

             

            eth info:

            bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
                    inet 10.255.153.128  netmask 255.255.255.0  broadcast 10.255.153.255
                    ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
                    RX packets 1023833  bytes 164896622 (157.2 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 698498  bytes 128080201 (122.1 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 
            0ens5np0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
                    ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
                    RX packets 429410  bytes 62164248 (59.2 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 398857  bytes 72638593 (69.2 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 
            0ens6np1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
                    ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
                    RX packets 594423  bytes 102732374 (97.9 MiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 299642  bytes 55441854 (52.8 MiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
            lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
                    inet 127.0.0.1  netmask 255.0.0.0
                    inet6 ::1  prefixlen 128  scopeid 0x10<host>
                    loop  txqueuelen 1000  (Local Loopback)
                    RX packets 3251  bytes 934240 (912.3 KiB)
                    RX errors 0  dropped 0  overruns 0  frame 0
                    TX packets 3251  bytes 934240 (912.3 KiB)
                    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0 

            node30(crash):

            [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# ll
            total 2358904
            -rw------- 1 root root      52042 Sep 27 15:56 kexec-dmesg.log
            -rw------- 1 root root 2415373763 Sep 27 15:56 vmcore
            -rw------- 1 root root      89731 Sep 27 15:55 vmcore-dmesg.txt
            [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# date
            Sat Sep 28 07:20:46 CST 2024
            [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# stat vmcore
              File: vmcore
              Size: 2415373763      Blocks: 4717528    IO Block: 4096   regular file
            Device: fd02h/64770d    Inode: 1272797     Links: 1
            Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
            Access: 2024-09-27 16:27:13.394000000 +0800
            Modify: 2024-09-27 15:56:08.100000000 +0800
            Change: 2024-09-27 15:56:08.382000000 +0800
             Birth: 2024-09-27 15:55:53.543000000 +0800 

            Thanks, shadow. you gave me very useful information.

            node30 only found that the RDMA link with node28 timed out, but I found that node30 crashed when node28 was powered on.

            Do you have more detailed information about your problem? We can analyze their correlation together.

             

            xiyan Rongyao Peng added a comment -   mdt0- 10.255.153.128@o2ib log: Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6033] device (bond1): state change: unmanaged -> unavailable (reason 'managed' , sys-iface-state: 'external' ) Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Using default interface naming scheme 'rhel-8.0' . Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Could not generate persistent MAC address for bond1: No such file or directory Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6062] device (bond1): state change: unavailable -> disconnected (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6073] policy: auto-activating connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6076] device (bond1): Activation: starting connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6077] device (bond1): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6079] manager: NetworkManager state is now CONNECTING Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6083] device (bond1): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6085] device (bond1): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6089] policy: set 'bond1' (bond1) as default for IPv4 routing and DNS Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6099] device (bond1): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6113] device (bond1): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6114] device (bond1): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6117] manager: NetworkManager state is now CONNECTED_SITE Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6120] device (bond1): Activation: successful, device activated. Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6125] manager: NetworkManager state is now CONNECTED_GLOBAL Sep 27 15:55:34 node28-lustre NetworkManager[1041]: <info>  [1727423734.5862] manager: startup complete Sep 27 15:55:34 node28-lustre systemd[1]: Started Network Manager Wait Online. Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Network is Online. Sep 27 15:55:34 node28-lustre systemd[1]: Starting Notify NFS peers of a restart... Sep 27 15:55:34 node28-lustre systemd[1]: Starting System Logging Service... Sep 27 15:55:34 node28-lustre systemd[1]: Starting Crash recovery kernel arming... Sep 27 15:55:34 node28-lustre sm-notify[1513]: Version 2.3.3 starting Sep 27 15:55:34 node28-lustre systemd[1]: Started Notify NFS peers of a restart. Sep 27 15:55:34 node28-lustre rsyslogd[1514]: [origin software= "rsyslogd" swVersion= "8.2102.0-15.el8" x-pid= "1514" x-info= "https: //www.rsyslog.com" ] start Sep 27 15:55:34 node28-lustre systemd[1]: Started System Logging Service. Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Multi-User System . Sep 27 15:55:34 node28-lustre systemd[1]: Started nvme mount. Sep 27 15:55:34 node28-lustre systemd[1]: Starting Update UTMP about System Runlevel Changes... Sep 27 15:55:34 node28-lustre systemd[1]: systemd-update-utmp-runlevel.service: Succeeded. Sep 27 15:55:34 node28-lustre systemd[1]: Started Update UTMP about System Runlevel Changes. Sep 27 15:55:34 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:34 node28-lustre rsyslogd[1514]: imjournal: journal files changed, reloading...  [v8.2102.0-15.el8 try https: //www.rsyslog.com/e/0 ] Sep 27 15:55:36 node28-lustre kernel: alg: No test for pkcs1pad(rsa,sha1) (pkcs1pad(rsa- generic ,sha1)) Sep 27 15:55:36 node28-lustre kernel: PKCS7: Message signed outside of X.509 validity window Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: kexec: loaded kdump kernel Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: Starting kdump: [OK] Sep 27 15:55:36 node28-lustre systemd[1]: Started Crash recovery kernel arming. Sep 27 15:55:36 node28-lustre systemd[1]: Startup finished in 1.123s (kernel) + 2.553s (initrd) + 9.235s (userspace) = 12.912s. Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:37 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens5np0: link becomes ready Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4081] device (ens5np0): carrier: link connected Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4092] device (ens5np0): state change: unavailable -> disconnected (reason 'carrier-changed' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4098] policy: auto-activating connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4103] device (ens5np0): Activation: starting connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4104] device (ens5np0): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4107] device (ens5np0): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4114] device (ens5np0): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:37 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link down Sep 27 15:55:37 node28-lustre kernel: bond1: (slave ens5np0): Enslaving as a backup interface with a down link Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7306] device (bond1): attached bond port ens5np0 Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7307] device (ens5np0): Activation: connection 'bond-slave-ens5np0' enslaved, continuing activation Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7425] device (ens5np0): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7439] device (ens5np0): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7441] device (ens5np0): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7446] device (ens5np0): Activation: successful, device activated. Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7077] device (ens6np1): carrier: link connected Sep 27 15:55:38 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up Sep 27 15:55:38 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens6np1: link becomes ready Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7085] device (ens6np1): state change: unavailable -> disconnected (reason 'carrier-changed' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7089] policy: auto-activating connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): Activation: starting connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7094] device (ens6np1): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7096] device (ens6np1): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down Sep 27 15:55:39 node28-lustre kernel: bond1: (slave ens6np1): Enslaving as a backup interface with a down link Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (bond1): attached bond port ens6np1 Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (ens6np1): Activation: connection 'bond-slave-ens6np1' enslaved, continuing activation Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0190] device (ens6np1): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0202] device (ens6np1): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0203] device (ens6np1): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0207] device (ens6np1): Activation: successful, device activated. Sep 27 15:55:40 node28-lustre systemd[1]: Stopping RDMA Node Description Daemon... Sep 27 15:55:40 node28-lustre systemd[1]: rdma-ndd.service: Succeeded. Sep 27 15:55:40 node28-lustre systemd[1]: Stopped RDMA Node Description Daemon. Sep 27 15:55:40 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:40 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: shared_fdb:0 mode:hash Sep 27 15:55:42 node28-lustre systemd[1]: /usr/lib/systemd/system/rdma-ndd.service:25: Unknown lvalue 'ProtectKernelLogs' in section 'Service' Sep 27 15:55:42 node28-lustre systemd[1]: Starting RDMA Node Description Daemon... Sep 27 15:55:42 node28-lustre systemd[1]: Started RDMA Node Description Daemon. Sep 27 15:55:43 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:43 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:46 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.0599] device (ens5np0): carrier: link connected Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down Sep 27 15:55:47 node28-lustre kernel: bond1: (slave ens5np0): link status definitely up, 100000 Mbps full duplex Sep 27 15:55:47 node28-lustre kernel: bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond Sep 27 15:55:47 node28-lustre kernel: bond1: active interface up! Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.1252] device (bond1): carrier: link connected Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1 Sep 27 15:55:48 node28-lustre root[1979]: openibd: Set node_desc for mlx5_bond_0: node28-lustre HCA-1 Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:48 node28-lustre NetworkManager[1041]: <info>  [1727423748.3576] device (ens6np1): carrier: link connected Sep 27 15:55:48 node28-lustre kernel: bond1: (slave ens6np1): link status definitely up, 100000 Mbps full duplex Sep 27 15:55:49 node28-lustre systemd[1]: NetworkManager-dispatcher.service: Succeeded. Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1 Sep 27 15:55:50 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:55 node28-lustre systemd[1]: Created slice User Slice of UID 0. Sep 27 15:55:55 node28-lustre systemd[1]: Starting User runtime directory /run/user/0... Sep 27 15:55:55 node28-lustre systemd-logind[1043]: New session 1 of user root. Sep 27 15:55:55 node28-lustre systemd[1]: Started User runtime directory /run/user/0. Sep 27 15:55:55 node28-lustre systemd[1]: Starting User Manager for UID 0... Sep 27 15:55:56 node28-lustre systemd[1989]: Starting D-Bus User Message Bus Socket. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Paths. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Timers. Sep 27 15:55:56 node28-lustre systemd[1989]: Listening on D-Bus User Message Bus Socket. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Sockets. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Basic System . Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Default. Sep 27 15:55:56 node28-lustre systemd[1989]: Startup finished in 24ms. Sep 27 15:55:56 node28-lustre systemd[1]: Started User Manager for UID 0. Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 1 of user root. Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 3 of user root. Sep 27 15:55:56 node28-lustre systemd-logind[1043]: New session 3 of user root. Sep 27 15:55:56 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:56 node28-lustre kernel: nvme nvme0: creating 14 I/O queues. Sep 27 15:55:56 node28-lustre kernel: mlx5_ib: mlx5_set_umr_free_mkey: Translation mode supported only when access_mode is MTT or PA Sep 27 15:55:58 node28-lustre systemd[1]: systemd-hostnamed.service: Succeeded. Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: mapped 14/0/0 default /read/poll queues. Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.26:4421 Sep 27 15:55:59 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:59 node28-lustre kernel: nvme nvme1: creating 14 I/O queues. Sep 27 15:56:01 node28-lustre systemd[1]: Started Session 4 of user root. Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.133:4421 Sep 27 15:56:02 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:56:02 node28-lustre kernel: nvme nvme2: creating 14 I/O queues. Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.28:4421 Sep 27 15:56:05 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:56:05 node28-lustre kernel: nvme nvme3: creating 14 I/O queues. Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.29:4421   eth info: bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500         inet 10.255.153.128  netmask 255.255.255.0  broadcast 10.255.153.255         ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)         RX packets 1023833  bytes 164896622 (157.2 MiB)         RX errors 0  dropped 0  overruns 0  frame 0         TX packets 698498  bytes 128080201 (122.1 MiB)         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens5np0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500         ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)         RX packets 429410  bytes 62164248 (59.2 MiB)         RX errors 0  dropped 0  overruns 0  frame 0         TX packets 398857  bytes 72638593 (69.2 MiB)         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0ens6np1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500         ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)         RX packets 594423  bytes 102732374 (97.9 MiB)         RX errors 0  dropped 0  overruns 0  frame 0         TX packets 299642  bytes 55441854 (52.8 MiB)         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536         inet 127.0.0.1  netmask 255.0.0.0         inet6 ::1  prefixlen 128  scopeid 0x10<host>         loop  txqueuelen 1000  (Local Loopback)         RX packets 3251  bytes 934240 (912.3 KiB)         RX errors 0  dropped 0  overruns 0  frame 0         TX packets 3251  bytes 934240 (912.3 KiB)         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0 node30(crash): [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# ll total 2358904 -rw------- 1 root root      52042 Sep 27 15:56 kexec-dmesg.log -rw------- 1 root root 2415373763 Sep 27 15:56 vmcore -rw------- 1 root root      89731 Sep 27 15:55 vmcore-dmesg.txt [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# date Sat Sep 28 07:20:46 CST 2024 [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# stat vmcore   File: vmcore   Size: 2415373763      Blocks: 4717528    IO Block: 4096   regular file Device: fd02h/64770d    Inode: 1272797     Links: 1 Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root) Access: 2024-09-27 16:27:13.394000000 +0800 Modify: 2024-09-27 15:56:08.100000000 +0800 Change: 2024-09-27 15:56:08.382000000 +0800  Birth: 2024-09-27 15:55:53.543000000 +0800 Thanks, shadow . you gave me very useful information. node30 only found that the RDMA link with node28 timed out, but I found that node30 crashed when node28 was powered on. Do you have more detailed information about your problem? We can analyze their correlation together.  

            did you have some error connect events before crash?

            It probably it's same as I see some time ago.
            Lustre relies to the CMID->qp flag to have disconnect, but MOFED think different.
            CM events may generated for CM without QP but state is connecting, it's open race when lustre release a CMID, but MOFED will continue to send an events.

            shadow Alexey Lyashkov added a comment - did you have some error connect events before crash? It probably it's same as I see some time ago. Lustre relies to the CMID->qp flag to have disconnect, but MOFED think different. CM events may generated for CM without QP but state is connecting, it's open race when lustre release a CMID, but MOFED will continue to send an events.
            xiyan Rongyao Peng added a comment - - edited

            int
            kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event)
            .....   

            case RDMA_CM_EVENT_UNREACHABLE:
                    conn = cmid->context;
                    CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n",
                        libcfs_nid2str(conn->ibc_peer->ibp_nid),
                        event->status, cmid, conn, conn->ibc_state);
                    LASSERT(conn->ibc_state != IBLND_CONN_INIT);
                    if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT ||
                        conn->ibc_state == IBLND_CONN_PASSIVE_WAIT) {
                        kiblnd_connreq_done(conn, -ENETDOWN);
                        kiblnd_conn_decref(conn);
                   }
                            return 0;

            ....

            conn is a null pointer, no other key points have been found yet.

            xiyan Rongyao Peng added a comment - - edited int kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event) .....    case RDMA_CM_EVENT_UNREACHABLE:         conn = cmid->context;         CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n",             libcfs_nid2str(conn->ibc_peer->ibp_nid),             event->status, cmid, conn, conn->ibc_state);         LASSERT(conn->ibc_state != IBLND_CONN_INIT);         if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT ||             conn->ibc_state == IBLND_CONN_PASSIVE_WAIT) {             kiblnd_connreq_done(conn, -ENETDOWN);             kiblnd_conn_decref(conn);        }                 return 0; .... conn is a null pointer, no other key points have been found yet.

            People

              xiyan Rongyao Peng
              xiyan Rongyao Peng
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: