[LU-14858] kernfs tree to dump/traverse ldlm locks Created: 16/Jul/21  Updated: 04/Feb/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: WC Triage
Resolution: Unresolved Votes: 2
Labels: ldlm

Issue Links:
Related
is related to LU-6529 Server side lock limits to avoid unne... Closed
is related to LU-14859 cancel client DLM locks from the server Open
is related to LU-16375 dump more information for threads blo... Open
Rank (Obsolete): 9223372036854775807

 Description   

It would be very useful for debugging to have a parameter tree under ldlm.namespaces.<namespace>.resources.*.* (and also linking the <namespace> under osc|mdc.*.ldlm.resources.* or similar) that allowed dumping all of the LDLM lock resources, along with the current locks on each resource. Essentially there should be a ...resources.<resid> file for each resource (where <resid> is in the form of a FID <seq>:<oid>:<ver>.<hash>) and "lctl list_param" could be used on the client or server to list all of the resources with active locks. The content of each .<resid> file would be the list of locks on that resource (one lock per line in YAML format), and the lock parameters, like (shown wrapped for convenience, but should be printed one line per lock):

ldlm.namespaces.testfs-MDT0000-mdc-ffff.resources.0x200000007:0x1:0x0.0=
       - { lockh: 0x2e1ae7d1e49694a8, remote: 0x2e1ae7d1e496949a,
           type: IBT, req: PW, grant: PR, flags: 0x40210000000000,
           nid: 192.168.10.35@tcp, bits: 0x13, try_bits: 0x0 }
ldlm.namespaces.testfs-OST0000-osc-ffff.resources.0x10001000:0x2:0x0.0=
       - { lockh; 0x2e1ae84397894a4, remote: 0x2e1ae7d17364748c,
           type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
           nid: 192.168.10.36@tcp, start: 0, end: 1048575, gid: 0 }
       - { lockh; 0x2e1ae84397794b4, remote: 0x2e1ae7d173688430,
           type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
           nid: 192.168.10.36@tcp, start: 10485760, end: 11534335, gid: 0 }

The output should have common fields at the start, and type-specific fields at the end, to simplify parsing by scripts (eg.awk can find fields by position).

It would be useful to show both granted and blocked locks for each resource, so it is possible to see how much contention there is on each resource.

This would also be useful on the MDS for dumping all of the flock locks in the system for debugging/monitoring.

Writing "clear" or "0" into one of the .<resid> files would either cancel all locks on the resource (if on the client and no references are held), or send a blocking callback to clients locking that resource (if on the server). We might refine this to allow cancelling a specific local or remote lock handle on that resource by writing the specific local or peer <lockh> into .<resid>.



 Comments   
Comment by Andreas Dilger [ 16/Jul/21 ]

It would also be useful to expose the last_activity timestamp for each lock line, and add a new activity_count: counter (for the whole resource) that shows the number of times that resource was referenced since it was created (e.g. via new locks and lock prolong operations). That would allow debugging how active a particular object is on the client or server, which is currently quite difficult to do without dumping verbose logs and processing them.

Comment by Nir Talmor [ 04/Feb/24 ]

Hi Andreas
would be great having this option for our customers
thanks

Generated at Sat Feb 10 03:13:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.