Description
When a server thread is blocked on a DLM lock, it is often difficult to see what the other threads in the system are doing with that lock, and why it is being held for a long time:
lfs02-n29 kernel: Pid: 16214, comm: ll_ost11_002 3.10.0-1160.45.1.el7.x86_64 lfs02-n29 kernel: Call Trace: lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc] lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd] lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd] lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc] lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc] lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc] lfs02-n29 kernel: Pid: 16274, comm: ll_ost14_005 3.10.0-1160.45.1.el7.x86_64 lfs02-n29 kernel: Call Trace: lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc] lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd] lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd] lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc] lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc] lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]
and then later these threads time out on their DLM blocking AST, which dumps the resource FID for their request, but no information about which thread is holding the lock:
lfs02-n29 kernel: LustreError: 16214:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1670282438, 300s ago); not entering recovery in server code, just going back to sleep ns: filter-lfs02-OST0038_UUID lock: ffff984ae2538480/0x4bc7fa9cb92be450 lrc: 3/0,1 mode: --/PW res: [0x800000410:0x813652:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 16214 timeout: 0 lvb_type: 0 lfs02-n29 kernel: LustreError: 16274:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1670282473, 300s ago); not entering recovery in server code, just going back to sleep ns: filter-lfs02-OST0039_UUID lock: ffff98394b735680/0x4bc7fa9cb94f8af9 lrc: 3/0,1 mode: --/PW res: [0x84000040c:0x8b52d3:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 16274 timeout: 0 lvb_type: 0
Since we know this is a local lock holder ("not entering recovery in server code") it should be possible to add LDLM_ERROR() printing of the conflicting locks held on that resource, and libcfs_debug_dumpstack() for the PID(s) that are holding the lock(s).