Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.15.5
-
None
-
Lustre server 2.15.5 RoCE
Lustre MGS 2.15.5 RoCE
Lustre client 2.15.5 RoCE
-
3
-
9223372036854775807
Description
Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.
【OS】
VM Version: qemu-kvm-7.0.0
OS Verion: Rocky 8.10
Kernel Verion: 4.18.0-553.el8_10.x86_64
【Network Card】
Client:
MLX CX6 1*100G RoCE v2
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
Server:
MLX CX6 2*100G RoCE v2 multi-rail
MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64
【BUG Info】
Here is the following reproducer:
- Mount lustre on a RoCE network
- Constructing a single network port failure of Lustre Server (50S interval of failure and recovery)
- The client is evicted because the lock callback timed out
Test Command:
- Client: Vdbench random read and write
- Server:for i in {1..1000};do ifconfig ens6f0np0 down;sleep 20; ifconfig ens6f0np0 up; sleep 30;done
【Log Info】
19:57:09 network port failure.
0010000:00020000:16.0:2025-05-19 19:59:02:0:5800:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 103s: evicting client at 10.255.153.142@o2ib ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 282 pid: 7295 timeout: 13543 lvb_type: 1
OST:There are two extension locks for the same res, same range, and same mode, and both send LDLM_BL_CALLBACK to OSC.
00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:1017:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000000000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 0 lvb_type: 1
00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:499:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x70000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 13543 lvb_type: 1
00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:1017:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000000000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 0 lvb_type: 1
00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:499:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x70000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 13543 lvb_type: 1
00010000:00010000:9.0:2025-05-19 19:57:19:0:5794:0:(ldlm_lockd.c:1832:ldlm_request_cancel()) ### server cancels blocked lock after 0s ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 233 pid: 7295 timeout: 13543 lvb_type: 1
00010000:00010000:9.0:2025-05-19 19:57:19:0:5794:0:(ldlm_lockd.c:573:ldlm_del_waiting_lock()) ### removed ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 233 pid: 7295 timeout: 13543 lvb_type: 1
OSC:Received two LDLM_BL_CALLBACKs of the same lock, and only send LDLM_CANCEL to OST once.
00000100:00010000:35.0F:2025-05-19 19:57:19:0:2793:0:(service.c:2315:ptlrpc_server_handle_request()) got req 1832535926691328
00000100:00010000:37.0F:2025-05-19 19:57:19:0:2794:0:(service.c:2315:ptlrpc_server_handle_request()) got req 1832535926691392
00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_lockd.c:1931:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420000000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1
00010000:00010000:31.0:2025-05-19 19:57:19:0:3901:0:(ldlm_lockd.c:1931:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420000000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1
00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_lockd.c:1945:ldlm_handle_bl_callback()) ### Lock 00000000f4d1844e already unused, calling callback (000000006203d3eb) ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1
00010000:00010000:31.0:2025-05-19 19:57:19:0:3901:0:(ldlm_lockd.c:1945:ldlm_handle_bl_callback()) ### Lock 00000000f4d1844e already unused, calling callback (000000006203d3eb) ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1
00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_request.c:1302:ldlm_cancel_pack()) ### packing 00000000cf61bb32 ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 2/0,0 mode: -/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>1073741823) gid 0 flags: 0x804c69400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1
00000100:00010000:21.0:2025-05-19 19:57:19:0:2719:0:(niobuf.c:944:ptl_send_rpc()) @@@ send flags=0 req@00000000cf61bb32 x1832541655502144/t0(0) o103->CeaPFS-OST0002-osc-ff28bc47121ca000@10.255.153.172@o2ib:17/18 lens 328/224 e 0 to 0 dl 1747655901 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' timeout: 52
Attachments
Issue Links
- is related to
-
LU-16064 RPC from evicted client can corrupt data
-
- In Progress
-