added a comment - I tried to isolate the file open/locked issue by clearing the debug buffer on server and client and using path2fid to get the fid in order to filter debug logs for operations on the specific file.
The server-side log is r2u31n1-ldebug.txt
The client-side log is r2u05n1-ldebug.txt
State of the LFS when this test was done: Previously run workflow (single client) left file in odd lock state . Client left mounted. Cleared debug buffers on client and server. Ran a test script that cleared debug buffers, ran the lock test program (attached here as wholocked.c) then immediately dumped the server and client debug buffers to files (attached).
Client is running 2.15.5 with the following LUs applied: 17692, 17589, 17871.
Server is running stock, unmodified 2.15.5
FID:
0x200000bd2:0xae:0x0 : /mnt/lustre/test/uid___A002_Xf99bb0_X839e_targets.selfcal.ms/table.lock (file that remains locked)
The client reports the flock LR mistmatch:
00000080:00020000:4.0:1724452804.927656:0:4365:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2890/4365, start=0/0, end=0/9223372036854775807,type=0/2
From the client-side:
00010000:00010000:4.0:1724452804.927366:0:4365:0:(ldlm_request.c:1029:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x80000 ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 3/0,1 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927373:0:4365:0:(ldlm_request.c:1117:ldlm_cli_enqueue()) ### sending request ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 3/0,1 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927611:0:4365:0:(ldlm_request.c:717:ldlm_cli_enqueue_fini()) ### server returned different mode PR ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 4/0,1 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x80000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927613:0:4365:0:(ldlm_request.c:745:ldlm_cli_enqueue_fini()) ### client-side enqueue, new policy data ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 4/0,1 mode: --/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x80000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927625:0:4365:0:(ldlm_flock.c:797:ldlm_flock_completion_ast()) ### client-side enqueue granted ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 4/0,1 mode: PR/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x10000000080000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927628:0:4365:0:(ldlm_flock.c:133:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x800000000) ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 4/0,1 mode: PR/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x10000000080000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927631:0:4365:0:(ldlm_lock.c:820:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PW) ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 4/0,1 mode: PR/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x10002400080000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927639:0:4365:0:(ldlm_request.c:799:ldlm_cli_enqueue_fini()) ### client-side enqueue END ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 2/0,0 mode: PR/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x4002400080000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00010000:00010000:4.0:1724452804.927642:0:4365:0:(ldlm_lock.c:216:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: testlfs-MDT0000-mdc-ffff95167d4ff800 lock: 000000002316aef8/0x43f859ce2434db3f lrc: 0/0,0 mode: PR/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 2 type: FLK pid: 2890 [0->0] flags: 0x4002400080000 nid: local remote: 0xd1c4c0f5acc3a572 expref: -99 pid: 4365 timeout: 0
00000080:00020000:4.0:1724452804.927656:0:4365:0:(file.c:4899:ll_file_flock()) Flock LR mismatch! inode=[0x200000bd2:0xae:0x0], flags=0x80000, mode=2, pid=2890/4365, start=0/0, end=0/9223372036854775807,type=0/2
00000080:00200000:4.0:1724452804.927707:0:4365:0:(file.c:406:ll_file_release()) VFS Op:inode=[0x200000bd2:0xae:0x0](000000006578d164)
00800000:00000002:4.0:1724452804.927713:0:4365:0:(lmv_obd.c:3572:lmv_lock_match()) Lock match for [0x200000bd2:0xae:0x0]
00800000:00000002:4.0:1724452804.927723:0:4365:0:(lmv_obd.c:1457:lmv_close()) CLOSE [0x200000bd2:0xae:0x0]
00000002:00000002:4.0:1724452804.927725:0:4365:0:(mdc_request.c:904:mdc_close()) testlfs-MDT0000-mdc-ffff95167d4ff800: [0x200000bd2:0xae:0x0] file closed with intent: 0
From the server-side (combined MDS/OSS, single server LFS, one MDT, two OSTs)
00000004:00000002:7.0:1724452804.926688:0:27747:0:(mdt_open.c:1353:mdt_reint_open()) I am going to open [0x200000bd2:0x9:0x0]/(->[0x200000bd2:0xae:0x0]) cr_flag=0240000003 mode=0100000 msg_flag=0x0
00000004:00000040:7.0:1724452804.926699:0:27747:0:(mdt_handler.c:3427:mdt_object_find()) Find object for [0x200000bd2:0xae:0x0]
00000004:00000002:7.0:1724452804.926722:0:27747:0:(mdt_som.c:91:mdt_get_som()) [0x200000bd2:0xae:0x0]: Reading som attrs: valid: 4, size: 417, blocks: 41
00000004:00000002:7.0:1724452804.926732:0:27747:0:(mdt_open.c:875:mdt_object_open_lock()) normal open:[0x200000bd2:0xae:0x0] lease count: 0, lm: 16
00010000:00010000:7.0:1724452804.926747:0:27747:0:(ldlm_lock.c:766:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(CR) ns: mdt-testlfs-MDT0000_UUID lock: 00000000785d9258/0xd1c4c0f5acc3a55d lrc: 3/1,0 mode: --/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x0/0x0 rrc: 3 type: IBT gid 0 flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 27747 timeout: 0 lvb_type: 0
00010000:00000040:7.0:1724452804.926759:0:27747:0:(ldlm_resource.c:1786:ldlm_resource_dump()) --- Resource: [0x200000bd2:0xae:0x0].0x0 (00000000ef53ad7f) refcount = 2
00010000:00000040:7.0:1724452804.926761:0:27747:0:(ldlm_resource.c:1792:ldlm_resource_dump()) ### ### ns: mdt-testlfs-MDT0000_UUID lock: 00000000b9ddf5ad/0xd1c4c0f5acc3a4ae lrc: 2/0,0 mode: CR/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x9/0x0 rrc: 3 type: IBT gid 0 flags: 0x40200000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434dac8 expref: 131 pid: 27747 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1724452804.926764:0:27747:0:(ldlm_lock.c:1064:ldlm_granted_list_add_lock()) ### About to add lock: ns: mdt-testlfs-MDT0000_UUID lock: 00000000785d9258/0xd1c4c0f5acc3a55d lrc: 3/1,0 mode: CR/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x9/0x0 rrc: 3 type: IBT gid 0 flags: 0x50210000000000 nid: local remote: 0x0 expref: -99 pid: 27747 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1724452804.926775:0:27747:0:(ldlm_request.c:531:ldlm_cli_enqueue_local()) ### client-side local enqueue handler, new lock created ns: mdt-testlfs-MDT0000_UUID lock: 00000000785d9258/0xd1c4c0f5acc3a55d lrc: 3/1,0 mode: CR/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x9/0x0 rrc: 3 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 27747 timeout: 0 lvb_type: 0
00000004:00000002:7.0:1724452804.926785:0:27747:0:(mdt_open.c:898:mdt_object_open_lock()) testlfs-MDT0000: Requested bits lock:[0x200000bd2:0xae:0x0], ibits = 0x9/0x9, open_flags = 0240000003, try_layout = 1 : rc = 0
00000004:00000002:7.0:1724452804.926792:0:27747:0:(mdt_handler.c:869:mdt_pack_attr2body()) [0x200000bd2:0xae:0x0]: nlink=1, mode=100644, valid=0x1100000000002f8f
00000004:00200000:7.0:1724452804.926793:0:27747:0:(mdt_handler.c:912:mdt_pack_attr2body()) [0x200000bd2:0xae:0x0]: returning size 417
00000004:00002000:7.0:1724452804.926815:0:27747:0:(mdt_open.c:300:mdt_mfd_set_mode()) [0x200000bd2:0xae:0x0] Change mfd open_flags 0 -> 0240000003.
00010000:00010000:7.0:1724452804.926927:0:27747:0:(ldlm_lockd.c:1484:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply (err=0, rc=0) ns: mdt-testlfs-MDT0000_UUID lock: 00000000785d9258/0xd1c4c0f5acc3a55d lrc: 3/0,0 mode: CR/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x9/0x0 rrc: 3 type: IBT gid 0 flags: 0x40200000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db31 expref: 132 pid: 27747 timeout: 0 lvb_type: 0
00010000:00010000:7.0:1724452804.926933:0:27747:0:(ldlm_inodebits.c:94:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200000bd2:0xae:0x0].0x0 (00000000ef53ad7f)
00010000:00010000:22.0:1724452804.927464:0:196958:0:(ldlm_lock.c:216:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-testlfs-MDT0000_UUID lock: 00000000785d9258/0xd1c4c0f5acc3a55d lrc: 0/0,0 mode: --/CR res: [0x200000bd2:0xae:0x0].0x0 bits 0x9/0x0 rrc: 4 type: IBT gid 0 flags: 0x44a01000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db31 expref: 132 pid: 27747 timeout: 0 lvb_type: 0
00010000:00010000:22.0:1724452804.927477:0:196958:0:(ldlm_inodebits.c:94:ldlm_reprocess_inodebits_queue()) --- Reprocess resource [0x200000bd2:0xae:0x0].0x0 (00000000ef53ad7f)
00010000:00010000:7.0:1724452804.927519:0:27747:0:(ldlm_lockd.c:1318:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-testlfs-MDT0000_UUID lock: 000000000b90a0e3/0xd1c4c0f5acc3a56b lrc: 2/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 0 [0->0] flags: 0x40000000000000 nid: local remote: 0x43f859ce2434db38 expref: -99 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927534:0:27747:0:(ldlm_flock.c:133:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x40000) ns: mdt-testlfs-MDT0000_UUID lock: 000000000b90a0e3/0xd1c4c0f5acc3a56b lrc: 3/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x50000000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db38 expref: 133 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927548:0:27747:0:(ldlm_lockd.c:1484:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply (err=-11, rc=-11) ns: mdt-testlfs-MDT0000_UUID lock: 000000000b90a0e3/0xd1c4c0f5acc3a56b lrc: 1/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x44000000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db38 expref: 133 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927566:0:27747:0:(ldlm_lock.c:216:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-testlfs-MDT0000_UUID lock: 000000000b90a0e3/0xd1c4c0f5acc3a56b lrc: 0/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x44801000000000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db38 expref: 132 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927867:0:27747:0:(ldlm_lockd.c:1318:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: mdt-testlfs-MDT0000_UUID lock: 00000000534d6628/0xd1c4c0f5acc3a572 lrc: 2/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 0 [0->0] flags: 0x40000000000000 nid: local remote: 0x43f859ce2434db3f expref: -99 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927876:0:27747:0:(ldlm_flock.c:133:ldlm_flock_destroy()) ### ldlm_flock_destroy(mode: 2, flags: 0x80000) ns: mdt-testlfs-MDT0000_UUID lock: 00000000534d6628/0xd1c4c0f5acc3a572 lrc: 3/0,0 mode: --/PW res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 4365 [0->9223372036854775807] flags: 0x50000000080000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db3f expref: 132 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927885:0:27747:0:(ldlm_lockd.c:1484:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply (err=0, rc=0) ns: mdt-testlfs-MDT0000_UUID lock: 00000000534d6628/0xd1c4c0f5acc3a572 lrc: 1/0,0 mode: --/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 2890 [0->0] flags: 0x44000000080000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db3f expref: 132 pid: 27747 timeout: 0
00010000:00010000:7.0:1724452804.927888:0:27747:0:(ldlm_lock.c:216:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: mdt-testlfs-MDT0000_UUID lock: 00000000534d6628/0xd1c4c0f5acc3a572 lrc: 0/0,0 mode: --/PR res: [0x200000bd2:0xae:0x0].0xc rrc: 10 type: FLK pid: 2890 [0->0] flags: 0x44000000080000 nid: 10.0.50.13@o2ib remote: 0x43f859ce2434db3f expref: 132 pid: 27747 timeout: 0
00000004:00000002:25.0:1724452804.928285:0:56733:0:(mdt_open.c:2403:mdt_mfd_close()) testlfs-MDT0000: closing file handle [0x200000bd2:0xae:0x0] with intent: 0
00000004:00000002:25.0:1724452804.928308:0:56733:0:(mdd_object.c:1244:mdd_attr_set()) testlfs-MDD0000: no valid attribute on [0x200000bd2:0xae:0x0], previous was 0x7
Patch: https://review.whamcloud.com/c/fs/lustre-release/+/56087