Description
It would be very useful for debugging to have a parameter tree under ldlm.namespaces.<namespace>.resources.*.* (and also linking the <namespace> under osc|mdc.*.ldlm.resources.* or similar) that allowed dumping all of the LDLM lock resources, along with the current locks on each resource. Essentially there should be a ...resources.<resid> file for each resource (where <resid> is in the form of a FID <seq>:<oid>:<ver>.<hash>) and "lctl list_param" could be used on the client or server to list all of the resources with active locks. The content of each .<resid> file would be the list of locks on that resource (one lock per line in YAML format), and the lock parameters, like (shown wrapped for convenience, but should be printed one line per lock):
ldlm.namespaces.testfs-MDT0000-mdc-ffff.resources.0x200000007:0x1:0x0.0=
- { lockh: 0x2e1ae7d1e49694a8, remote: 0x2e1ae7d1e496949a,
type: IBT, req: PW, grant: PR, flags: 0x40210000000000,
nid: 192.168.10.35@tcp, bits: 0x13, try_bits: 0x0 }
ldlm.namespaces.testfs-OST0000-osc-ffff.resources.0x10001000:0x2:0x0.0=
- { lockh; 0x2e1ae84397894a4, remote: 0x2e1ae7d17364748c,
type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
nid: 192.168.10.36@tcp, start: 0, end: 1048575, gid: 0 }
- { lockh; 0x2e1ae84397794b4, remote: 0x2e1ae7d173688430,
type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
nid: 192.168.10.36@tcp, start: 10485760, end: 11534335, gid: 0 }
The output should have common fields at the start, and type-specific fields at the end, to simplify parsing by scripts (eg.awk can find fields by position).
It would be useful to show both granted and blocked locks for each resource, so it is possible to see how much contention there is on each resource. The main intention would be to allow debugging why a client or server thread is blocked on a DLM lock by dumping the current holder(s) and waiter(s) on a particular lock resource, and then following the trail to the node(s) and thread PIDs holding a lock on the resource to see why it is not releasing the lock.
This would also be useful on the MDS for dumping all of the flock locks in the system for debugging/monitoring.
Writing "clear" or "0" into one of the .<resid> files would either cancel all locks on the resource (if on the client and no references are held), or send a blocking callback to clients locking that resource (if on the server). We might refine this to allow cancelling a specific local or remote lock handle on that resource by writing the specific local or peer <lockh> into .<resid>.
Attachments
Issue Links
- is duplicated by
-
LU-3662 Allow reading from ldlm.dump_namespaces or new ldlm.namespaces.*.dump_locks file
-
- Resolved
-
- is related to
-
LU-16375 dump more information for threads blocked on local DLM locks
-
- Open
-
-
LU-19768 add jobstats, brw_stats, others to netlink/YAML output
-
- In Progress
-
- is related to
-
LU-6529 Server side lock limits to avoid unnecessary memory exhaustion
-
- Closed
-
-
LU-9680 Improve the user land to kernel space interface for lustre
-
- Open
-
-
LU-14859 cancel client DLM locks from the server
-
- Open
-