Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19032

ldlm: duplicate locking leads to client eviction

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.5
    • None
    • Lustre server 2.15.5 RoCE
      Lustre MGS 2.15.5 RoCE
      Lustre client 2.15.5 RoCE
    • 3
    • 9223372036854775807

    Description

      Lustre's client and server are deployed within the VM, The VM uses the network card PF pass-through mode.

      【OS】
      VM Version: qemu-kvm-7.0.0
      OS Verion: Rocky 8.10
      Kernel Verion: 4.18.0-553.el8_10.x86_64

      【Network Card】
      Client:
      MLX CX6 1*100G RoCE v2
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

      Server:
      MLX CX6 2*100G RoCE v2 multi-rail
      MLNX_OFED_LINUX-23.10-3.2.2.0-rhel8.10-x86_64

       

      【BUG Info】

      Here is the following reproducer:

      • Mount lustre on a RoCE network
      • Constructing a single network port failure of Lustre Server (50S interval of failure and recovery)
      • The client is evicted because the lock callback timed out

       

      Test Command:

      • Client: Vdbench random read and write
      • Server:for i in {1..1000};do ifconfig ens6f0np0 down;sleep 20; ifconfig ens6f0np0 up; sleep 30;done

       

      【Log Info】

      19:57:09 network port failure.

      0010000:00020000:16.0:2025-05-19 19:59:02:0:5800:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 103s: evicting client at 10.255.153.142@o2ib ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 282 pid: 7295 timeout: 13543 lvb_type: 1

       

      OST:There are two extension locks for the same res, same range, and same mode, and both send LDLM_BL_CALLBACK to OSC.

      00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:1017:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000000000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 0 lvb_type: 1

      00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:499:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x70000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 13543 lvb_type: 1

      00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:1017:ldlm_server_blocking_ast()) ### server preparing blocking AST ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000000000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 0 lvb_type: 1

      00010000:00010000:8.0:2025-05-19 19:57:19:0:7652:0:(ldlm_lockd.c:499:ldlm_add_waiting_lock()) ### adding to wait list(timeout: 100, AT: on) ns: filter-CeaPFS-OST0002_UUID lock: 000000002f285d1c/0xd9e9f38857f9e386 lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x70000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 232 pid: 7295 timeout: 13543 lvb_type: 1

      00010000:00010000:9.0:2025-05-19 19:57:19:0:5794:0:(ldlm_lockd.c:1832:ldlm_request_cancel()) ### server cancels blocked lock after 0s ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 4/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x60000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 233 pid: 7295 timeout: 13543 lvb_type: 1

      00010000:00010000:9.0:2025-05-19 19:57:19:0:5794:0:(ldlm_lockd.c:573:ldlm_del_waiting_lock()) ### removed ns: filter-CeaPFS-OST0002_UUID lock: 0000000014179b64/0xd9e9f38857f9e44a lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x50000400000020 nid: 10.255.153.142@o2ib remote: 0xb088d4fae4619a96 expref: 233 pid: 7295 timeout: 13543 lvb_type: 1

       
      OSC:Received two LDLM_BL_CALLBACKs of the same lock, and only send LDLM_CANCEL to OST once.

      00000100:00010000:35.0F:2025-05-19 19:57:19:0:2793:0:(service.c:2315:ptlrpc_server_handle_request()) got req 1832535926691328

      00000100:00010000:37.0F:2025-05-19 19:57:19:0:2794:0:(service.c:2315:ptlrpc_server_handle_request()) got req 1832535926691392

      00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_lockd.c:1931:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420000000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1

      00010000:00010000:31.0:2025-05-19 19:57:19:0:3901:0:(ldlm_lockd.c:1931:ldlm_handle_bl_callback()) ### client blocking AST callback handler ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420000000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1

      00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_lockd.c:1945:ldlm_handle_bl_callback()) ### Lock 00000000f4d1844e already unused, calling callback (000000006203d3eb) ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1

      00010000:00010000:31.0:2025-05-19 19:57:19:0:3901:0:(ldlm_lockd.c:1945:ldlm_handle_bl_callback()) ### Lock 00000000f4d1844e already unused, calling callback (000000006203d3eb) ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 3/0,0 mode: PR/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1073741823) gid 0 flags: 0x800420400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1

      00010000:00010000:21.0:2025-05-19 19:57:19:0:6997:0:(ldlm_request.c:1302:ldlm_cancel_pack()) ### packing 00000000cf61bb32 ns: CeaPFS-OST0002-osc-ff28bc47121ca000 lock: 00000000f4d1844e/0xb088d4fae4619a96 lrc: 2/0,0 mode: -/PR res: [0x640000400:0x189ff:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0>1073741823) gid 0 flags: 0x804c69400000000 nid: local remote: 0xd9e9f38857f9e44a expref: -99 pid: 11788 timeout: 0 lvb_type: 1

      00000100:00010000:21.0:2025-05-19 19:57:19:0:2719:0:(niobuf.c:944:ptl_send_rpc()) @@@ send flags=0 req@00000000cf61bb32 x1832541655502144/t0(0) o103->CeaPFS-OST0002-osc-ff28bc47121ca000@10.255.153.172@o2ib:17/18 lens 328/224 e 0 to 0 dl 1747655901 ref 2 fl Rpc:r/0/ffffffff rc 0/-1 job:'' timeout: 52

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              xiyan Rongyao Peng
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: