[LU-16375] dump more information for threads blocked on local DLM locks - Whamcloud Community JIRA

Details

Type: Improvement
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- debug
- lug24dd
- medium

Rank (Obsolete):
9223372036854775807

Description

When a server thread is blocked on a DLM lock, it is often difficult to see what the other threads in the system are doing with that lock, and why it is being held for a long time:

lfs02-n29 kernel: Pid: 16214, comm: ll_ost11_002 3.10.0-1160.45.1.el7.x86_64
lfs02-n29 kernel: Call Trace:
lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd]
lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd]
lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc]
lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc]
lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]
lfs02-n29 kernel: Pid: 16274, comm: ll_ost14_005 3.10.0-1160.45.1.el7.x86_64
lfs02-n29 kernel: Call Trace:
lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd]
lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd]
lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc]
lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc]
lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]

and then later these threads time out on their DLM blocking AST, which dumps the resource FID for their request, but no information about which thread is holding the lock:

lfs02-n29 kernel: LustreError: 16214:0:(ldlm_request.c:124:ldlm_expired_completion_wait())
    ### lock timed out (enqueued at 1670282438, 300s ago); not entering recovery in
    server code, just going back to sleep ns: filter-lfs02-OST0038_UUID
    lock: ffff984ae2538480/0x4bc7fa9cb92be450 lrc: 3/0,1 mode: --/PW
    res: [0x800000410:0x813652:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
    (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local
    remote: 0x0 expref: -99 pid: 16214 timeout: 0 lvb_type: 0
lfs02-n29 kernel: LustreError: 16274:0:(ldlm_request.c:124:ldlm_expired_completion_wait())
    ### lock timed out (enqueued at 1670282473, 300s ago); not entering recovery in
    server code, just going back to sleep ns: filter-lfs02-OST0039_UUID
    lock: ffff98394b735680/0x4bc7fa9cb94f8af9 lrc: 3/0,1 mode: --/PW
    res: [0x84000040c:0x8b52d3:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
    (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local
    remote: 0x0 expref: -99 pid: 16274 timeout: 0 lvb_type: 0

Since we know this is a local lock holder ("not entering recovery in server code") it should be possible to add LDLM_ERROR() printing of the conflicting locks held on that resource, and libcfs_debug_dumpstack() for the PID(s) that are holding the lock(s).

Attachments

Issue Links

is duplicated by

LU-16625 improved Lustre thread debugging

In Progress

is related to

LU-17540 sync and delay before LBUG() calls panic()

Resolved

is related to

LU-14858 kernfs tree to dump/traverse ldlm lock resources for debug

Open

Activity

[LU-16375] dump more information for threads blocked on local DLM locks

Gerrit Updater added a comment - 05/Dec/24 5:54 AM

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57296
Subject: LU-16375 ldlm: dump more info for threads blocked on local DLM locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9f6d52c8b915136f816ccbf55cf5735905242250

Gerrit Updater added a comment - 05/Dec/24 5:54 AM "Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57296 Subject: LU-16375 ldlm: dump more info for threads blocked on local DLM locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9f6d52c8b915136f816ccbf55cf5735905242250

dump more information for threads blocked on local DLM locks

Details

Description

Attachments

Issue Links

Activity

People

Dates