[LU-18275] o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.15.5
Labels:
- o2iblnd
Environment:
Lustre server 2.15.5 RoCE
Lustre MGS 2.15.5 RoCE
Lustre client 2.15.5 RoCE

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64

【Network Card】
Client：
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

Server:
MLX CX6 2*100G RoCE v2 bond
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

【BUG Info】

Here is the following reproducer:

Mount lustre on a RoCE network
Construct lustre MDT(mdt0-10.255.153.128@o2ib) restart
Crash occurs on other lustre servers

Server call trace:

crash> bt
PID: 568423   TASK: ff4787632aa5c000  CPU: 5    COMMAND: "kworker/u40:0"
 #0 [ff7d728b15e6baa0] machine_kexec at ffffffff8fa6f353
 #1 [ff7d728b15e6baf8] __crash_kexec at ffffffff8fbbaa7a
 #2 [ff7d728b15e6bbb8] crash_kexec at ffffffff8fbbb9b1
 #3 [ff7d728b15e6bbd0] oops_end at ffffffff8fa2d831
 #4 [ff7d728b15e6bbf0] no_context at ffffffff8fa81cf3
 #5 [ff7d728b15e6bc48] __bad_area_nosemaphore at ffffffff8fa8206c
 #6 [ff7d728b15e6bc90] do_page_fault at ffffffff8fa82cf7
 #7 [ff7d728b15e6bcc0] page_fault at ffffffff906011ae
    [exception RIP: kiblnd_cm_callback+2653]
    RIP: ffffffffc0efe00d  RSP: ff7d728b15e6bd70  RFLAGS: 00010246
    RAX: 0000000000000007  RBX: ff7d728b15e6be08  RCX: 0000000000000000
    RDX: ff4787632aa5c000  RSI: ff7d728b15e6be08  RDI: ff47876095213c00
    RBP: ff47876095213c00   R8: 0000000000000000   R9: 006d635f616d6472
    R10: 8080808080808080  R11: 0000000000000000  R12: ff47876095213c00
    R13: 0000000000000000  R14: 0000000000000000  R15: ff47876095213de0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ff7d728b15e6bdd8] cma_cm_event_handler at ffffffffc04729a5 [rdma_cm]
 #9 [ff7d728b15e6be00] cma_netevent_work_handler at ffffffffc04786b5 [rdma_cm]
#10 [ff7d728b15e6be90] process_one_work at ffffffff8fb195e3
#11 [ff7d728b15e6bed8] worker_thread at ffffffff8fb197d0
#12 [ff7d728b15e6bf10] kthread at ffffffff8fb20e24
#13 [ff7d728b15e6bf50] ret_from_fork at ffffffff9060028f

Server kernel log:

[69106.143672] LustreError: 11-0: lustre-MDT000a-osp-MDT0007: operation mds_statfs to node 10.255.153.128@o2ib failed: rc = -107
[69106.143700] Lustre: lustre-OST0038-osc-MDT0007: Connection to lustre-OST0038 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69106.145053] LustreError: Skipped 29 previous similar messages
[69106.145060] Lustre: Skipped 203 previous similar messages
[69111.263490] Lustre: lustre-OST004e-osc-MDT0007: Connection to lustre-OST004e (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69111.263496] Lustre: Skipped 6 previous similar messages
[69112.506974] Lustre: lustre-OST0038-osc-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69112.506980] Lustre: Skipped 195 previous similar messages
[69116.383952] Lustre: lustre-MDT0000-lwp-MDT0007: Connection to lustre-MDT0000 (at 10.255.153.128@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[69116.383965] Lustre: Skipped 3 previous similar messages
[69122.528880] Lustre: lustre-MDT000a-lwp-OST0013: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69122.528885] Lustre: Skipped 3 previous similar messages
[69127.659951] Lustre: lustre-OST0043-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69127.659960] Lustre: Skipped 7 previous similar messages
[69138.672269] Lustre: lustre-OST002d-osc-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69138.672275] Lustre: Skipped 3 previous similar messages
[69158.168201] Lustre: lustre-MDT000a-osp-MDT0007: Connection restored to 10.255.153.129@o2ib (at 10.255.153.129@o2ib)
[69158.168206] Lustre: Skipped 1 previous similar message
[69178.775546] Lustre: lustre-MDT0000-osp-MDT0007: Connection restored to 10.255.153.233@o2ib (at 10.255.153.233@o2ib)
[69178.775554] Lustre: Skipped 11 previous similar messages
[69178.805333] Lustre: lustre-OST0008: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.805386] Lustre: lustre-OST0013: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.805505] Lustre: lustre-OST001e: deleting orphan objects from 0x0:13846 to 0x0:13921
[69178.805729] Lustre: lustre-OST003f: deleting orphan objects from 0x0:13822 to 0x0:13857
[69178.806122] Lustre: lustre-OST004a: deleting orphan objects from 0x0:13855 to 0x0:13889
[69178.806130] Lustre: lustre-OST0055: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.807518] Lustre: lustre-OST0029: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.807537] Lustre: lustre-OST0034: deleting orphan objects from 0x0:13854 to 0x0:13889
[69178.838633] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0034: Delaying reintegration for qtype:2 until pending updates are flushed.
[69178.840301] LustreError: 39177:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 1 previous similar message
[69180.358640] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) lustre-OST0013: Delaying reintegration for qtype:2 until pending updates are flushed.
[69180.359161] LustreError: 37895:0:(qsd_reint.c:635:qqi_reint_delayed()) Skipped 2 previous similar messages
[69239.967382] BUG: unable to handle kernel NULL pointer dereference at 000000000000004c
[69239.968927] PGD 0 
[69239.969144] Oops: 0000 [#1] SMP NOPTI
[69239.969327] CPU: 5 PID: 568423 Comm: kworker/u40:0 Kdump: loaded Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1
[69239.969650] Hardware name: Red Hat KVM, BIOS 1.16.0-4.cl9 04/01/2014
[69239.969792] Workqueue: rdma_cm cma_netevent_work_handler [rdma_cm]
[69239.969995] RIP: 0010:kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
[69239.970191] Code: 48 89 05 06 7f 01 00 c7 05 04 7f 01 00 00 00 02 02 48 c7 05 01 7f 01 00 d0 5e f1 c0 e8 ac 66 ee ff e9 b5 f7 ff ff 4c 8b 6f 08 <41> 8b 6d 4c f6 05 e5 d4 f0 ff 01 0f 84 8d 00 00 00 f6 05 dc d4 f0
[69239.970483] RSP: 0018:ff7d728b15e6bd70 EFLAGS: 00010246
[69239.970621] RAX: 0000000000000007 RBX: ff7d728b15e6be08 RCX: 0000000000000000
[69239.970763] RDX: ff4787632aa5c000 RSI: ff7d728b15e6be08 RDI: ff47876095213c00
[69239.970895] RBP: ff47876095213c00 R08: 0000000000000000 R09: 006d635f616d6472
[69239.971025] R10: 8080808080808080 R11: 0000000000000000 R12: ff47876095213c00
[69239.971155] R13: 0000000000000000 R14: 0000000000000000 R15: ff47876095213de0
[69239.971289] FS:  0000000000000000(0000) GS:ff47877d7f740000(0000) knlGS:0000000000000000
[69239.971425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69239.971570] CR2: 000000000000004c CR3: 0000001900e10004 CR4: 0000000000771ee0
[69239.971708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[69239.971841] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[69239.971971] PKRU: 55555554
[69239.972099] Call Trace:
[69239.972267]  ? __die_body+0x1a/0x60
[69239.972451]  ? no_context+0x1ba/0x3f0
[69239.972583]  ? __bad_area_nosemaphore+0x16c/0x1c0
[69239.972705]  ? do_page_fault+0x37/0x12d
[69239.972826]  ? page_fault+0x1e/0x30
[69239.972971]  ? kiblnd_cm_callback+0xa5d/0x1ea0 [ko2iblnd]
[69239.973099]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
[69239.973234]  cma_netevent_work_handler+0x75/0xd0 [rdma_cm]
[69239.973362]  process_one_work+0x1d3/0x390
[69239.973516]  worker_thread+0x30/0x390
[69239.973631]  ? process_one_work+0x390/0x390
[69239.973743]  kthread+0x134/0x150
[69239.973863]  ? set_kthread_struct+0x50/0x50
[69239.973976]  ret_from_fork+0x1f/0x40
[69239.974099] Modules linked in: ofd(OE) ost(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ldiskfs(OE) mbcache jbd2 ko2iblnd(OE) lnet(OE) libcfs(OE) bonding uio_pci_generic uio vfio_pci vfio_virqfd vfio_iommu_type1 vfio cuse fuse rdma_ucm(OE) ib_ipoib(OE) ib_umad(OE) sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm cirrus drm_shmem_helper kvm_intel kvm irqbypass drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect ghash_clmulni_intel sysimgblt rapl drm i2c_piix4 pcspkr virtio_balloon joydev knem(OE) xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ata_generic mlx5_core(OE) mlxfw(OE) ata_piix psample pci_hyperv_intf tls crc32c_intel virtio_console virtio_blk libata serio_raw mlxdevm(OE) xpmem(OE) nvme_tcp(OE) nvme_rdma(OE) rdma_cm(OE) iw_cm(OE) nvme_fabrics(OE) nvme_core(OE) ib_cm(OE) ib_core(OE) mlx_compat(OE) t10_pi
[69239.975388] CR2: 000000000000004c

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

image-2025-05-20-16-19-24-929.png
0.6 kB
20/May/25 8:19 AM

Issue Links

is related to

LU-18364 rdma_cm: unable to handle kernel NULL pointer dereference in process_one_work when disconnect

Open

is related to

LU-18260 o2iblnd: graceful handling of unexpected RDMA_CM_EVENT_REJECTED

Resolved

LU-16184 o2iblnd: set TX deadline when adding to peer queue

Resolved

LU-17480 lustre_rmmod hangs if a lnet route is down

Resolved

Activity

[LU-18275] o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE

Rongyao Peng added a comment - 20/May/25 8:26 AM

Hi, eaujames

When dealing with this issue, we suspect that the problem arises because the cmid still exists while the connection has already been released during the connection destruction process. This might be due to an incompatibility between the Lustre and RoCE protocols. We have made a fix for this issue, and combined with the patch you previously fixed, the problem has not reoccurred after our extensive testing. Could you please help review whether the modifications in this patch are appropriate? Thank you!

int
kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event)
.....
case RDMA_CM_EVENT_UNREACHABLE:
conn = cmid->context;
/* conn maybe in RDMA_CM_EVENT_DISCONNECTED event set NULL;
* if RDMA_CM_EVENT_UNREACHABLE after RDMA_CM_EVENT_DISCONNECTED,
* crash may be happen. So, need to check conn is null or not. */
if (conn == NULL) {
CNETERR("UNREACHABLE event, status %d cm_id %p conn is null\n",
event->status, cmid);
return 0;
{color:#FF0000}}

CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
event->status, cmid, conn, conn->ibc_state);
LASSERT(conn->ibc_state != IBLND_CONN_INIT);
if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT ||
conn->ibc_state == IBLND_CONN_PASSIVE_WAIT)

Unknown macro: { kiblnd_connreq_done(conn, -ENETDOWN); kiblnd_conn_decref(conn); }

return 0;
....

Rongyao Peng added a comment - 20/May/25 8:26 AM Hi, eaujames When dealing with this issue, we suspect that the problem arises because the cmid still exists while the connection has already been released during the connection destruction process. This might be due to an incompatibility between the Lustre and RoCE protocols. We have made a fix for this issue, and combined with the patch you previously fixed, the problem has not reoccurred after our extensive testing. Could you please help review whether the modifications in this patch are appropriate? Thank you! int kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event) ..... case RDMA_CM_EVENT_UNREACHABLE: conn = cmid->context; /* conn maybe in RDMA_CM_EVENT_DISCONNECTED event set NULL; * if RDMA_CM_EVENT_UNREACHABLE after RDMA_CM_EVENT_DISCONNECTED, * crash may be happen. So, need to check conn is null or not. */ if (conn == NULL) { CNETERR("UNREACHABLE event, status %d cm_id %p conn is null\n", event->status, cmid); return 0; {color:#FF0000}} CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n", libcfs_nid2str(conn->ibc_peer->ibp_nid), event->status, cmid, conn, conn->ibc_state); LASSERT(conn->ibc_state != IBLND_CONN_INIT); if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT || conn->ibc_state == IBLND_CONN_PASSIVE_WAIT) Unknown macro: { kiblnd_connreq_done(conn, -ENETDOWN); kiblnd_conn_decref(conn); } return 0; ....

Etienne Aujames added a comment - 14/Oct/24 9:57 AM

~~LU-18260~~ o2iblnd: fix race between REJ vs kiblnd_connd
~~LU-17480~~ o2iblnd: add a timeout for rdma_connect
~~LU-16184~~ o2iblnd: fix deadline for tx on peer queue
~~LU-17632~~ o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR
~~LU-17325~~ o2iblnd: CM_EVENT_UNREACHABLE on established conn
~~LU-15885~~ o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE

The backports will be part of the next CEA's Lustre release.

Etienne Aujames added a comment - 14/Oct/24 9:57 AM LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd LU-17480 o2iblnd: add a timeout for rdma_connect LU-16184 o2iblnd: fix deadline for tx on peer queue LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE The backports will be part of the next CEA's Lustre release.

Etienne Aujames added a comment - 11/Oct/24 4:14 PM

xiyan, thanks for testing the backports. I will tags these backports with LTS15.

"cmid->context = NULL;" is set in the "RDMA_CM_EVENT_DISCONNECTED" event (0x4c is the offset of ibc_state):

kiblnd_cm_callback(
...
        case RDMA_CM_EVENT_DISCONNECTED:
                conn = cmid->context;
                if (conn->ibc_state < IBLND_CONN_ESTABLISHED) {
                        CERROR("%s DISCONNECTED\n",
                               libcfs_nid2str(conn->ibc_peer->ibp_nid));
                        kiblnd_connreq_done(conn, -ECONNRESET);
                } else {
                        kiblnd_close_conn(conn, 0);
                }
                kiblnd_conn_decref(conn);
                cmid->context = NULL;         <-------
                return 0;

That would mean that the RDMA_CM_EVENT_UNREACHABLE is received after the RDMA_CM_EVENT_DISCONNECTED? A race is possible in that case, but I am not sure how this can happen.

And I don't know why we set "cmid->context = NULL" in kiblnd_cm_callback()

Etienne Aujames added a comment - 11/Oct/24 4:14 PM xiyan , thanks for testing the backports. I will tags these backports with LTS15. "cmid->context = NULL;" is set in the "RDMA_CM_EVENT_DISCONNECTED" event (0x4c is the offset of ibc_state): kiblnd_cm_callback( ... case RDMA_CM_EVENT_DISCONNECTED: conn = cmid->context; if (conn->ibc_state < IBLND_CONN_ESTABLISHED) { CERROR( "%s DISCONNECTED\n" , libcfs_nid2str(conn->ibc_peer->ibp_nid)); kiblnd_connreq_done(conn, -ECONNRESET); } else { kiblnd_close_conn(conn, 0); } kiblnd_conn_decref(conn); cmid->context = NULL ; <------- return 0; That would mean that the RDMA_CM_EVENT_UNREACHABLE is received after the RDMA_CM_EVENT_DISCONNECTED? A race is possible in that case, but I am not sure how this can happen. And I don't know why we set "cmid->context = NULL" in kiblnd_cm_callback()

Rongyao Peng added a comment - 11/Oct/24 8:52 AM

After discussing issue ~~LU-18260~~ with eaujames, the issue no longer reappears after applying the following patch.

LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd
LU-17480 o2iblnd: add a timeout for rdma_connect
LU-16184 o2iblnd: fix deadline for tx on peer queue
LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR
LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn
LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE

I think the key patches should be ~~LU-17480~~ and ~~LU-16184~~. Of course, these are only temporary solutions. We will need to continue to locate the root cause in the future.

Rongyao Peng added a comment - 11/Oct/24 8:52 AM After discussing issue LU-18260 with eaujames , the issue no longer reappears after applying the following patch. LU-18260 o2iblnd: fix race between REJ vs kiblnd_connd LU-17480 o2iblnd: add a timeout for rdma_connect LU-16184 o2iblnd: fix deadline for tx on peer queue LU-17632 o2iblnd: graceful handling of CM_EVENT_CONNECT_ERROR LU-17325 o2iblnd: CM_EVENT_UNREACHABLE on established conn LU-15885 o2iblnd: fix handling of RDMA_CM_EVENT_UNREACHABLE I think the key patches should be LU-17480 and LU-16184 . Of course, these are only temporary solutions. We will need to continue to locate the root cause in the future.

Rongyao Peng added a comment - 27/Sep/24 11:54 PM

I suspect that ~~LU-17480~~ can reduce the probability of this problem because the current RDMA conn timeout is relatively long. As long as the RDMA conn is destroyed before the faulty host is powered on, I guess the probability of this problem will become smaller. But we also need to find its root cause.

Rongyao Peng added a comment - 27/Sep/24 11:54 PM I suspect that LU-17480 can reduce the probability of this problem because the current RDMA conn timeout is relatively long. As long as the RDMA conn is destroyed before the faulty host is powered on, I guess the probability of this problem will become smaller. But we also need to find its root cause.

Rongyao Peng added a comment - 27/Sep/24 11:41 PM

mdt0-10.255.153.128@o2ib log:

Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6033] device (bond1): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Using default interface naming scheme 'rhel-8.0'.
Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Could not generate persistent MAC address for bond1: No such file or directory
Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6062] device (bond1): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6073] policy: auto-activating connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab)
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6076] device (bond1): Activation: starting connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab)
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6077] device (bond1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6079] manager: NetworkManager state is now CONNECTING
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6083] device (bond1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6085] device (bond1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready
Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info>  [1727423728.6089] policy: set 'bond1' (bond1) as default for IPv4 routing and DNS
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6099] device (bond1): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6113] device (bond1): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6114] device (bond1): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6117] manager: NetworkManager state is now CONNECTED_SITE
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6120] device (bond1): Activation: successful, device activated.
Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info>  [1727423730.6125] manager: NetworkManager state is now CONNECTED_GLOBAL
Sep 27 15:55:34 node28-lustre NetworkManager[1041]: <info>  [1727423734.5862] manager: startup complete
Sep 27 15:55:34 node28-lustre systemd[1]: Started Network Manager Wait Online.
Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Network is Online.
Sep 27 15:55:34 node28-lustre systemd[1]: Starting Notify NFS peers of a restart...
Sep 27 15:55:34 node28-lustre systemd[1]: Starting System Logging Service...
Sep 27 15:55:34 node28-lustre systemd[1]: Starting Crash recovery kernel arming...
Sep 27 15:55:34 node28-lustre sm-notify[1513]: Version 2.3.3 starting
Sep 27 15:55:34 node28-lustre systemd[1]: Started Notify NFS peers of a restart.
Sep 27 15:55:34 node28-lustre rsyslogd[1514]: [origin software="rsyslogd" swVersion="8.2102.0-15.el8" x-pid="1514" x-info="https://www.rsyslog.com"] start
Sep 27 15:55:34 node28-lustre systemd[1]: Started System Logging Service.
Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Multi-User System.
Sep 27 15:55:34 node28-lustre systemd[1]: Started nvme mount.
Sep 27 15:55:34 node28-lustre systemd[1]: Starting Update UTMP about System Runlevel Changes...
Sep 27 15:55:34 node28-lustre systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Sep 27 15:55:34 node28-lustre systemd[1]: Started Update UTMP about System Runlevel Changes.
Sep 27 15:55:34 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:34 node28-lustre rsyslogd[1514]: imjournal: journal files changed, reloading...  [v8.2102.0-15.el8 try https://www.rsyslog.com/e/0 ]
Sep 27 15:55:36 node28-lustre kernel: alg: No test for pkcs1pad(rsa,sha1) (pkcs1pad(rsa-generic,sha1))
Sep 27 15:55:36 node28-lustre kernel: PKCS7: Message signed outside of X.509 validity window
Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: kexec: loaded kdump kernel
Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: Starting kdump: [OK]
Sep 27 15:55:36 node28-lustre systemd[1]: Started Crash recovery kernel arming.
Sep 27 15:55:36 node28-lustre systemd[1]: Startup finished in 1.123s (kernel) + 2.553s (initrd) + 9.235s (userspace) = 12.912s.
Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
Sep 27 15:55:37 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens5np0: link becomes ready
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4081] device (ens5np0): carrier: link connected
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4092] device (ens5np0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4098] policy: auto-activating connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a)
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4103] device (ens5np0): Activation: starting connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a)
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4104] device (ens5np0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4107] device (ens5np0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.4114] device (ens5np0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
Sep 27 15:55:37 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link down
Sep 27 15:55:37 node28-lustre kernel: bond1: (slave ens5np0): Enslaving as a backup interface with a down link
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7306] device (bond1): attached bond port ens5np0
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7307] device (ens5np0): Activation: connection 'bond-slave-ens5np0' enslaved, continuing activation
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7425] device (ens5np0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7439] device (ens5np0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7441] device (ens5np0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info>  [1727423737.7446] device (ens5np0): Activation: successful, device activated.
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7077] device (ens6np1): carrier: link connected
Sep 27 15:55:38 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up
Sep 27 15:55:38 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens6np1: link becomes ready
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7085] device (ens6np1): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7089] policy: auto-activating connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c)
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): Activation: starting connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c)
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7092] device (ens6np1): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7094] device (ens6np1): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info>  [1727423738.7096] device (ens6np1): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:39 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down
Sep 27 15:55:39 node28-lustre kernel: bond1: (slave ens6np1): Enslaving as a backup interface with a down link
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (bond1): attached bond port ens6np1
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0186] device (ens6np1): Activation: connection 'bond-slave-ens6np1' enslaved, continuing activation
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0190] device (ens6np1): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0202] device (ens6np1): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0203] device (ens6np1): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info>  [1727423739.0207] device (ens6np1): Activation: successful, device activated.
Sep 27 15:55:40 node28-lustre systemd[1]: Stopping RDMA Node Description Daemon...
Sep 27 15:55:40 node28-lustre systemd[1]: rdma-ndd.service: Succeeded.
Sep 27 15:55:40 node28-lustre systemd[1]: Stopped RDMA Node Description Daemon.
Sep 27 15:55:40 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
Sep 27 15:55:40 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: shared_fdb:0 mode:hash
Sep 27 15:55:42 node28-lustre systemd[1]: /usr/lib/systemd/system/rdma-ndd.service:25: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
Sep 27 15:55:42 node28-lustre systemd[1]: Starting RDMA Node Description Daemon...
Sep 27 15:55:42 node28-lustre systemd[1]: Started RDMA Node Description Daemon.
Sep 27 15:55:43 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
Sep 27 15:55:43 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:46 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104)
Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.0599] device (ens5np0): carrier: link connected
Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down
Sep 27 15:55:47 node28-lustre kernel: bond1: (slave ens5np0): link status definitely up, 100000 Mbps full duplex
Sep 27 15:55:47 node28-lustre kernel: bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond
Sep 27 15:55:47 node28-lustre kernel: bond1: active interface up!
Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info>  [1727423747.1252] device (bond1): carrier: link connected
Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1
Sep 27 15:55:48 node28-lustre root[1979]: openibd: Set node_desc for mlx5_bond_0: node28-lustre HCA-1
Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up
Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up
Sep 27 15:55:48 node28-lustre NetworkManager[1041]: <info>  [1727423748.3576] device (ens6np1): carrier: link connected
Sep 27 15:55:48 node28-lustre kernel: bond1: (slave ens6np1): link status definitely up, 100000 Mbps full duplex
Sep 27 15:55:49 node28-lustre systemd[1]: NetworkManager-dispatcher.service: Succeeded.
Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1
Sep 27 15:55:50 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2
Sep 27 15:55:55 node28-lustre systemd[1]: Created slice User Slice of UID 0.
Sep 27 15:55:55 node28-lustre systemd[1]: Starting User runtime directory /run/user/0...
Sep 27 15:55:55 node28-lustre systemd-logind[1043]: New session 1 of user root.
Sep 27 15:55:55 node28-lustre systemd[1]: Started User runtime directory /run/user/0.
Sep 27 15:55:55 node28-lustre systemd[1]: Starting User Manager for UID 0...
Sep 27 15:55:56 node28-lustre systemd[1989]: Starting D-Bus User Message Bus Socket.
Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Paths.
Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Timers.
Sep 27 15:55:56 node28-lustre systemd[1989]: Listening on D-Bus User Message Bus Socket.
Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Sockets.
Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Basic System.
Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Default.
Sep 27 15:55:56 node28-lustre systemd[1989]: Startup finished in 24ms.
Sep 27 15:55:56 node28-lustre systemd[1]: Started User Manager for UID 0.
Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 1 of user root.
Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 3 of user root.
Sep 27 15:55:56 node28-lustre systemd-logind[1043]: New session 3 of user root.
Sep 27 15:55:56 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:56 node28-lustre kernel: nvme nvme0: creating 14 I/O queues.
Sep 27 15:55:56 node28-lustre kernel: mlx5_ib: mlx5_set_umr_free_mkey: Translation mode supported only when access_mode is MTT or PA
Sep 27 15:55:58 node28-lustre systemd[1]: systemd-hostnamed.service: Succeeded.
Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: mapped 14/0/0 default/read/poll queues.
Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.26:4421
Sep 27 15:55:59 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:55:59 node28-lustre kernel: nvme nvme1: creating 14 I/O queues.
Sep 27 15:56:01 node28-lustre systemd[1]: Started Session 4 of user root.
Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: mapped 14/0/0 default/read/poll queues.
Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.133:4421
Sep 27 15:56:02 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:56:02 node28-lustre kernel: nvme nvme2: creating 14 I/O queues.
Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: mapped 14/0/0 default/read/poll queues.
Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.28:4421
Sep 27 15:56:05 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever
Sep 27 15:56:05 node28-lustre kernel: nvme nvme3: creating 14 I/O queues.
Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: mapped 14/0/0 default/read/poll queues.
Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19", addr 10.255.153.29:4421

eth info:

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 10.255.153.128  netmask 255.255.255.0  broadcast 10.255.153.255
        ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
        RX packets 1023833  bytes 164896622 (157.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 698498  bytes 128080201 (122.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 
0ens5np0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
        RX packets 429410  bytes 62164248 (59.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 398857  bytes 72638593 (69.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 
0ens6np1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 1500
        ether 10:70:fd:5e:03:1c  txqueuelen 1000  (Ethernet)
        RX packets 594423  bytes 102732374 (97.9 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 299642  bytes 55441854 (52.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3251  bytes 934240 (912.3 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3251  bytes 934240 (912.3 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

node30(crash):

[root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# ll
total 2358904
-rw------- 1 root root      52042 Sep 27 15:56 kexec-dmesg.log
-rw------- 1 root root 2415373763 Sep 27 15:56 vmcore
-rw------- 1 root root      89731 Sep 27 15:55 vmcore-dmesg.txt
[root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# date
Sat Sep 28 07:20:46 CST 2024
[root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# stat vmcore
  File: vmcore
  Size: 2415373763      Blocks: 4717528    IO Block: 4096   regular file
Device: fd02h/64770d    Inode: 1272797     Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-09-27 16:27:13.394000000 +0800
Modify: 2024-09-27 15:56:08.100000000 +0800
Change: 2024-09-27 15:56:08.382000000 +0800
 Birth: 2024-09-27 15:55:53.543000000 +0800

Thanks, shadow. you gave me very useful information.

node30 only found that the RDMA link with node28 timed out, but I found that node30 crashed when node28 was powered on.

Do you have more detailed information about your problem? We can analyze their correlation together.

Rongyao Peng added a comment - 27/Sep/24 11:41 PM mdt0- 10.255.153.128@o2ib log: Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6033] device (bond1): state change: unmanaged -> unavailable (reason 'managed' , sys-iface-state: 'external' ) Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Using default interface naming scheme 'rhel-8.0' . Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Sep 27 15:55:28 node28-lustre systemd-udevd[1467]: Could not generate persistent MAC address for bond1: No such file or directory Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6062] device (bond1): state change: unavailable -> disconnected (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6073] policy: auto-activating connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6076] device (bond1): Activation: starting connection 'bond1' (6c3be7c4-5e37-42f8-941d-dd7fe0919eab) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6077] device (bond1): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6079] manager: NetworkManager state is now CONNECTING Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6083] device (bond1): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6085] device (bond1): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:28 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_UP): bond1: link is not ready Sep 27 15:55:28 node28-lustre NetworkManager[1041]: <info> [1727423728.6089] policy: set 'bond1' (bond1) as default for IPv4 routing and DNS Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6099] device (bond1): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6113] device (bond1): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6114] device (bond1): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6117] manager: NetworkManager state is now CONNECTED_SITE Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6120] device (bond1): Activation: successful, device activated. Sep 27 15:55:30 node28-lustre NetworkManager[1041]: <info> [1727423730.6125] manager: NetworkManager state is now CONNECTED_GLOBAL Sep 27 15:55:34 node28-lustre NetworkManager[1041]: <info> [1727423734.5862] manager: startup complete Sep 27 15:55:34 node28-lustre systemd[1]: Started Network Manager Wait Online. Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Network is Online. Sep 27 15:55:34 node28-lustre systemd[1]: Starting Notify NFS peers of a restart... Sep 27 15:55:34 node28-lustre systemd[1]: Starting System Logging Service... Sep 27 15:55:34 node28-lustre systemd[1]: Starting Crash recovery kernel arming... Sep 27 15:55:34 node28-lustre sm-notify[1513]: Version 2.3.3 starting Sep 27 15:55:34 node28-lustre systemd[1]: Started Notify NFS peers of a restart. Sep 27 15:55:34 node28-lustre rsyslogd[1514]: [origin software= "rsyslogd" swVersion= "8.2102.0-15.el8" x-pid= "1514" x-info= "https: //www.rsyslog.com" ] start Sep 27 15:55:34 node28-lustre systemd[1]: Started System Logging Service. Sep 27 15:55:34 node28-lustre systemd[1]: Reached target Multi-User System . Sep 27 15:55:34 node28-lustre systemd[1]: Started nvme mount. Sep 27 15:55:34 node28-lustre systemd[1]: Starting Update UTMP about System Runlevel Changes... Sep 27 15:55:34 node28-lustre systemd[1]: systemd-update-utmp-runlevel.service: Succeeded. Sep 27 15:55:34 node28-lustre systemd[1]: Started Update UTMP about System Runlevel Changes. Sep 27 15:55:34 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:34 node28-lustre rsyslogd[1514]: imjournal: journal files changed, reloading... [v8.2102.0-15.el8 try https: //www.rsyslog.com/e/0 ] Sep 27 15:55:36 node28-lustre kernel: alg: No test for pkcs1pad(rsa,sha1) (pkcs1pad(rsa- generic ,sha1)) Sep 27 15:55:36 node28-lustre kernel: PKCS7: Message signed outside of X.509 validity window Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: kexec: loaded kdump kernel Sep 27 15:55:36 node28-lustre kdumpctl[1518]: kdump: Starting kdump: [OK] Sep 27 15:55:36 node28-lustre systemd[1]: Started Crash recovery kernel arming. Sep 27 15:55:36 node28-lustre systemd[1]: Startup finished in 1.123s (kernel) + 2.553s (initrd) + 9.235s (userspace) = 12.912s. Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:37 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens5np0: link becomes ready Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4081] device (ens5np0): carrier: link connected Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4092] device (ens5np0): state change: unavailable -> disconnected (reason 'carrier-changed' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4098] policy: auto-activating connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4103] device (ens5np0): Activation: starting connection 'bond-slave-ens5np0' (c07ec0c3-8e1d-4ffe-8398-17bd289dac7a) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4104] device (ens5np0): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4107] device (ens5np0): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.4114] device (ens5np0): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:37 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:37 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link down Sep 27 15:55:37 node28-lustre kernel: bond1: (slave ens5np0): Enslaving as a backup interface with a down link Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7306] device (bond1): attached bond port ens5np0 Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7307] device (ens5np0): Activation: connection 'bond-slave-ens5np0' enslaved, continuing activation Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7425] device (ens5np0): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7439] device (ens5np0): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7441] device (ens5np0): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:37 node28-lustre NetworkManager[1041]: <info> [1727423737.7446] device (ens5np0): Activation: successful, device activated. Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7077] device (ens6np1): carrier: link connected Sep 27 15:55:38 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up Sep 27 15:55:38 node28-lustre kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens6np1: link becomes ready Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7085] device (ens6np1): state change: unavailable -> disconnected (reason 'carrier-changed' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7089] policy: auto-activating connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7092] device (ens6np1): Activation: starting connection 'bond-slave-ens6np1' (3d58f2ee-50e0-4b49-9fd3-120b58c0d11c) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7092] device (ens6np1): state change: disconnected -> prepare (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7094] device (ens6np1): state change: prepare -> config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:38 node28-lustre NetworkManager[1041]: <info> [1727423738.7096] device (ens6np1): state change: config -> ip-config (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down Sep 27 15:55:39 node28-lustre kernel: bond1: (slave ens6np1): Enslaving as a backup interface with a down link Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0186] device (bond1): attached bond port ens6np1 Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0186] device (ens6np1): Activation: connection 'bond-slave-ens6np1' enslaved, continuing activation Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0190] device (ens6np1): state change: ip-config -> ip-check (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0202] device (ens6np1): state change: ip-check -> secondaries (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0203] device (ens6np1): state change: secondaries -> activated (reason 'none' , sys-iface-state: 'managed' ) Sep 27 15:55:39 node28-lustre NetworkManager[1041]: <info> [1727423739.0207] device (ens6np1): Activation: successful, device activated. Sep 27 15:55:40 node28-lustre systemd[1]: Stopping RDMA Node Description Daemon... Sep 27 15:55:40 node28-lustre systemd[1]: rdma-ndd.service: Succeeded. Sep 27 15:55:40 node28-lustre systemd[1]: Stopped RDMA Node Description Daemon. Sep 27 15:55:40 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:40 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:42 node28-lustre kernel: mlx5_core 0000:00:05.0: shared_fdb:0 mode:hash Sep 27 15:55:42 node28-lustre systemd[1]: /usr/lib/systemd/system/rdma-ndd.service:25: Unknown lvalue 'ProtectKernelLogs' in section 'Service' Sep 27 15:55:42 node28-lustre systemd[1]: Starting RDMA Node Description Daemon... Sep 27 15:55:42 node28-lustre systemd[1]: Started RDMA Node Description Daemon. Sep 27 15:55:43 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:43 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:46 node28-lustre kernel: nvme nvme0: rdma connection establishment failed (-104) Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info> [1727423747.0599] device (ens5np0): carrier: link connected Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link down Sep 27 15:55:47 node28-lustre kernel: bond1: (slave ens5np0): link status definitely up, 100000 Mbps full duplex Sep 27 15:55:47 node28-lustre kernel: bond1: Warning: No 802.3ad response from the link partner for any adapters in the bond Sep 27 15:55:47 node28-lustre kernel: bond1: active interface up! Sep 27 15:55:47 node28-lustre NetworkManager[1041]: <info> [1727423747.1252] device (bond1): carrier: link connected Sep 27 15:55:47 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1 Sep 27 15:55:48 node28-lustre root[1979]: openibd: Set node_desc for mlx5_bond_0: node28-lustre HCA-1 Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:06.0 ens6np1: Link up Sep 27 15:55:48 node28-lustre kernel: mlx5_core 0000:00:05.0 ens5np0: Link up Sep 27 15:55:48 node28-lustre NetworkManager[1041]: <info> [1727423748.3576] device (ens6np1): carrier: link connected Sep 27 15:55:48 node28-lustre kernel: bond1: (slave ens6np1): link status definitely up, 100000 Mbps full duplex Sep 27 15:55:49 node28-lustre systemd[1]: NetworkManager-dispatcher.service: Succeeded. Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:49 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1 Sep 27 15:55:50 node28-lustre kernel: mlx5_core 0000:00:05.0: lag map active ports: 1, 2 Sep 27 15:55:55 node28-lustre systemd[1]: Created slice User Slice of UID 0. Sep 27 15:55:55 node28-lustre systemd[1]: Starting User runtime directory /run/user/0... Sep 27 15:55:55 node28-lustre systemd-logind[1043]: New session 1 of user root. Sep 27 15:55:55 node28-lustre systemd[1]: Started User runtime directory /run/user/0. Sep 27 15:55:55 node28-lustre systemd[1]: Starting User Manager for UID 0... Sep 27 15:55:56 node28-lustre systemd[1989]: Starting D-Bus User Message Bus Socket. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Paths. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Timers. Sep 27 15:55:56 node28-lustre systemd[1989]: Listening on D-Bus User Message Bus Socket. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Sockets. Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Basic System . Sep 27 15:55:56 node28-lustre systemd[1989]: Reached target Default. Sep 27 15:55:56 node28-lustre systemd[1989]: Startup finished in 24ms. Sep 27 15:55:56 node28-lustre systemd[1]: Started User Manager for UID 0. Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 1 of user root. Sep 27 15:55:56 node28-lustre systemd[1]: Started Session 3 of user root. Sep 27 15:55:56 node28-lustre systemd-logind[1043]: New session 3 of user root. Sep 27 15:55:56 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:56 node28-lustre kernel: nvme nvme0: creating 14 I/O queues. Sep 27 15:55:56 node28-lustre kernel: mlx5_ib: mlx5_set_umr_free_mkey: Translation mode supported only when access_mode is MTT or PA Sep 27 15:55:58 node28-lustre systemd[1]: systemd-hostnamed.service: Succeeded. Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: mapped 14/0/0 default /read/poll queues. Sep 27 15:55:59 node28-lustre kernel: nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.26:4421 Sep 27 15:55:59 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:55:59 node28-lustre kernel: nvme nvme1: creating 14 I/O queues. Sep 27 15:56:01 node28-lustre systemd[1]: Started Session 4 of user root. Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:02 node28-lustre kernel: nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.133:4421 Sep 27 15:56:02 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:56:02 node28-lustre kernel: nvme nvme2: creating 14 I/O queues. Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:05 node28-lustre kernel: nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.28:4421 Sep 27 15:56:05 node28-lustre kernel: nvme_fabrics: ctrl_loss_tmo < 0 will reconnect forever Sep 27 15:56:05 node28-lustre kernel: nvme nvme3: creating 14 I/O queues. Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: mapped 14/0/0 default /read/poll queues. Sep 27 15:56:08 node28-lustre kernel: nvme nvme3: new ctrl: NQN "nqn.2014-08.org.nvmexpress:93c825a4fe519d19" , addr 10.255.153.29:4421 eth info: bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500 inet 10.255.153.128 netmask 255.255.255.0 broadcast 10.255.153.255 ether 10:70:fd:5e:03:1c txqueuelen 1000 (Ethernet) RX packets 1023833 bytes 164896622 (157.2 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 698498 bytes 128080201 (122.1 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens5np0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 ether 10:70:fd:5e:03:1c txqueuelen 1000 (Ethernet) RX packets 429410 bytes 62164248 (59.2 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 398857 bytes 72638593 (69.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens6np1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 ether 10:70:fd:5e:03:1c txqueuelen 1000 (Ethernet) RX packets 594423 bytes 102732374 (97.9 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 299642 bytes 55441854 (52.8 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 3251 bytes 934240 (912.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3251 bytes 934240 (912.3 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 node30(crash): [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# ll total 2358904 -rw------- 1 root root 52042 Sep 27 15:56 kexec-dmesg.log -rw------- 1 root root 2415373763 Sep 27 15:56 vmcore -rw------- 1 root root 89731 Sep 27 15:55 vmcore-dmesg.txt [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# date Sat Sep 28 07:20:46 CST 2024 [root@node30-lustre 127.0.0.1-2024-09-27-15:55:52]# stat vmcore File: vmcore Size: 2415373763 Blocks: 4717528 IO Block: 4096 regular file Device: fd02h/64770d Inode: 1272797 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2024-09-27 16:27:13.394000000 +0800 Modify: 2024-09-27 15:56:08.100000000 +0800 Change: 2024-09-27 15:56:08.382000000 +0800 Birth: 2024-09-27 15:55:53.543000000 +0800 Thanks, shadow . you gave me very useful information. node30 only found that the RDMA link with node28 timed out, but I found that node30 crashed when node28 was powered on. Do you have more detailed information about your problem? We can analyze their correlation together.

Alexey Lyashkov added a comment - 27/Sep/24 3:30 PM

did you have some error connect events before crash?

It probably it's same as I see some time ago.
Lustre relies to the CMID->qp flag to have disconnect, but MOFED think different.
CM events may generated for CM without QP but state is connecting, it's open race when lustre release a CMID, but MOFED will continue to send an events.

Alexey Lyashkov added a comment - 27/Sep/24 3:30 PM did you have some error connect events before crash? It probably it's same as I see some time ago. Lustre relies to the CMID->qp flag to have disconnect, but MOFED think different. CM events may generated for CM without QP but state is connecting, it's open race when lustre release a CMID, but MOFED will continue to send an events.

Rongyao Peng added a comment - 27/Sep/24 1:37 PM - edited

int
kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event)
.....

case RDMA_CM_EVENT_UNREACHABLE:
conn = cmid->context;
CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
event->status, cmid, conn, conn->ibc_state);
LASSERT(conn->ibc_state != IBLND_CONN_INIT);
if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT ||
conn->ibc_state == IBLND_CONN_PASSIVE_WAIT) {
kiblnd_connreq_done(conn, -ENETDOWN);
kiblnd_conn_decref(conn);
}
return 0;

....

conn is a null pointer, no other key points have been found yet.

Rongyao Peng added a comment - 27/Sep/24 1:37 PM - edited int kiblnd_cm_callback(struct rdma_cm_id *cmid, struct rdma_cm_event *event) ..... case RDMA_CM_EVENT_UNREACHABLE: conn = cmid->context; CNETERR("%s: UNREACHABLE %d cm_id %p conn %p ibc_state: %d\n", libcfs_nid2str(conn->ibc_peer->ibp_nid), event->status, cmid, conn, conn->ibc_state); LASSERT(conn->ibc_state != IBLND_CONN_INIT); if (conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT || conn->ibc_state == IBLND_CONN_PASSIVE_WAIT) { kiblnd_connreq_done(conn, -ENETDOWN); kiblnd_conn_decref(conn); } return 0; .... conn is a null pointer, no other key points have been found yet.

People

Assignee:: Rongyao Peng

Reporter:: Rongyao Peng

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 27/Sep/24 1:20 PM

Updated:: 20/May/25 8:26 AM