Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.15.5
-
Lustre server 2.15.5 RoCE
Lustre MGS 2.15.5 RoCE
Lustre client 2.15.5 RoCE
-
3
-
9223372036854775807
Description
Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.
【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64
【Network Card】
Client:
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
Server:
MLX CX6 2*100G RoCE v2 bond
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
【BUG Info】
Here is the following reproducer:
- Mount lustre on a RoCE network
- Construct Luster server restart
- Crash occurs on the client
Client call trace:
[97665.762774] Workqueue: ib_cm cm_work_handler [ib_cm] [97665.762977] Call Trace: [97665.763182] <TASK> [97665.763374] dump_stack_lvl+0x34/0x48 [97665.763586] panic+0x100/0x2d2 [97665.763756] lbug_with_loc.cold+0x18/0x18 [libcfs] [97665.763945] kiblnd_cm_callback+0x108d/0x10b0 [ko2iblnd] [97665.764116] cma_cm_event_handler+0x1e/0xb0 [rdma_cm] [97665.764279] cma_ib_handler+0x8d/0x2e0 [rdma_cm] [97665.764439] cm_process_work+0x22/0x190 [ib_cm] [97665.764597] ? cm_queue_work_unlock+0x2a/0xd0 [ib_cm] [97665.764751] cm_rej_handler+0xdf/0x260 [ib_cm] [97665.764909] cm_work_handler+0x47f/0x4d0 [ib_cm] [97665.765059] process_one_work+0x1e8/0x390 [97665.765203] worker_thread+0x53/0x3d0 [97665.765350] ? process_one_work+0x390/0x390 [97665.765488] kthread+0x124/0x150 [97665.765626] ? set_kthread_struct+0x50/0x50 [97665.765761] ret_from_fork+0x1f/0x30 [97665.765901] </TASK>
Client kernel log:
[94741.917887] Lustre: 3442:0:(client.c:2289:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1726811270/real 1726811270] req@0000000017c656bd x1810605440940608/t0(0) o400->lustre-MDT0011-mdc-ff395c3864984000@10.255.153.131@o2ib:12/10 lens 224/224 e 1 to 1 dl 1726811302 ref 1 fl Rpc:XQr/c0/ffffffff rc 0/-1 job:'' [94869.917879] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.124@o2ib: 3 seconds [94869.918229] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 195 previous similar messages [95091.934608] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30 [95091.935234] LustreError: Skipped 178 previous similar messages [95471.901885] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 1 seconds [95471.904650] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 205 previous similar messages [95694.879000] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30 [95694.879355] LustreError: Skipped 173 previous similar messages [96075.869883] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 0 seconds [96075.870246] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 198 previous similar messages [96294.942490] LustreError: 11-0: lustre-OST0022-osc-ff395c3864984000: operation ost_connect to node 10.255.153.128@o2ib failed: rc = -19 [96294.942795] LustreError: Skipped 175 previous similar messages [96681.885884] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.124@o2ib: 1 seconds [96681.886250] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 197 previous similar messages [96899.871468] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30 [96899.871784] LustreError: Skipped 176 previous similar messages [97283.869887] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Timed out tx for 10.255.153.123@o2ib: 1 seconds [97283.870258] LNet: 3430:0:(o2iblnd_cb.c:3442:kiblnd_check_conns()) Skipped 199 previous similar messages [97501.918467] LustreError: 11-0: lustre-OST004d-osc-ff395c3864984000: operation ost_connect to node 10.255.153.126@o2ib failed: rc = -30
Attachments
Issue Links
- is related to
-
LU-18275 o2iblnd: unable to handle kernel NULL pointer dereference in kiblnd_cm_callback when receiving RDMA_CM_EVENT_UNREACHABLE
- Open
- is related to
-
LU-15885 o2iblnd: RDMA_CM_EVENT_UNREACHABLE may be received after conn clean-up
- Resolved
-
LU-17325 o2iblnd: graceful handling of CM_EVENT_UNREACHABLE on established connection
- Resolved
-
LU-17480 lustre_rmmod hangs if a lnet route is down
- Resolved
-
LU-17632 o2iblnd: graceful handling of unexpected CM_EVENT_CONNECT_ERROR
- Resolved
-
LU-17689 o2iblnd: handle unexpected network data gracefully
- Resolved