[LU-14858] kernfs tree to dump/traverse ldlm locks Created: 16/Jul/21 Updated: 04/Feb/24 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | ldlm | ||
| Issue Links: |
|
||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
It would be very useful for debugging to have a parameter tree under ldlm.namespaces.<namespace>.resources.*.* (and also linking the <namespace> under osc|mdc.*.ldlm.resources.* or similar) that allowed dumping all of the LDLM lock resources, along with the current locks on each resource. Essentially there should be a ...resources.<resid> file for each resource (where <resid> is in the form of a FID <seq>:<oid>:<ver>.<hash>) and "lctl list_param" could be used on the client or server to list all of the resources with active locks. The content of each .<resid> file would be the list of locks on that resource (one lock per line in YAML format), and the lock parameters, like (shown wrapped for convenience, but should be printed one line per lock): ldlm.namespaces.testfs-MDT0000-mdc-ffff.resources.0x200000007:0x1:0x0.0=
- { lockh: 0x2e1ae7d1e49694a8, remote: 0x2e1ae7d1e496949a,
type: IBT, req: PW, grant: PR, flags: 0x40210000000000,
nid: 192.168.10.35@tcp, bits: 0x13, try_bits: 0x0 }
ldlm.namespaces.testfs-OST0000-osc-ffff.resources.0x10001000:0x2:0x0.0=
- { lockh; 0x2e1ae84397894a4, remote: 0x2e1ae7d17364748c,
type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
nid: 192.168.10.36@tcp, start: 0, end: 1048575, gid: 0 }
- { lockh; 0x2e1ae84397794b4, remote: 0x2e1ae7d173688430,
type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
nid: 192.168.10.36@tcp, start: 10485760, end: 11534335, gid: 0 }
The output should have common fields at the start, and type-specific fields at the end, to simplify parsing by scripts (eg.awk can find fields by position). It would be useful to show both granted and blocked locks for each resource, so it is possible to see how much contention there is on each resource. This would also be useful on the MDS for dumping all of the flock locks in the system for debugging/monitoring. Writing "clear" or "0" into one of the .<resid> files would either cancel all locks on the resource (if on the client and no references are held), or send a blocking callback to clients locking that resource (if on the server). We might refine this to allow cancelling a specific local or remote lock handle on that resource by writing the specific local or peer <lockh> into .<resid>. |
| Comments |
| Comment by Andreas Dilger [ 16/Jul/21 ] |
|
It would also be useful to expose the last_activity timestamp for each lock line, and add a new activity_count: counter (for the whole resource) that shows the number of times that resource was referenced since it was created (e.g. via new locks and lock prolong operations). That would allow debugging how active a particular object is on the client or server, which is currently quite difficult to do without dumping verbose logs and processing them. |
| Comment by Nir Talmor [ 04/Feb/24 ] |
|
Hi Andreas |