Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14858

kernfs tree to dump/traverse ldlm lock resources for debug

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      It would be very useful for debugging to have a parameter tree under ldlm.namespaces.<namespace>.resources.*.* (and also linking the <namespace> under osc|mdc.*.ldlm.resources.* or similar) that allowed dumping all of the LDLM lock resources, along with the current locks on each resource. Essentially there should be a ...resources.<resid> file for each resource (where <resid> is in the form of a FID <seq>:<oid>:<ver>.<hash>) and "lctl list_param" could be used on the client or server to list all of the resources with active locks. The content of each .<resid> file would be the list of locks on that resource (one lock per line in YAML format), and the lock parameters, like (shown wrapped for convenience, but should be printed one line per lock):

      ldlm.namespaces.testfs-MDT0000-mdc-ffff.resources.0x200000007:0x1:0x0.0=
             - { lockh: 0x2e1ae7d1e49694a8, remote: 0x2e1ae7d1e496949a,
                 type: IBT, req: PW, grant: PR, flags: 0x40210000000000,
                 nid: 192.168.10.35@tcp, bits: 0x13, try_bits: 0x0 }
      ldlm.namespaces.testfs-OST0000-osc-ffff.resources.0x10001000:0x2:0x0.0=
             - { lockh; 0x2e1ae84397894a4, remote: 0x2e1ae7d17364748c,
                 type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
                 nid: 192.168.10.36@tcp, start: 0, end: 1048575, gid: 0 }
             - { lockh; 0x2e1ae84397794b4, remote: 0x2e1ae7d173688430,
                 type: EXT, req: PW, grant: PW, flags: 0x40210000014000,
                 nid: 192.168.10.36@tcp, start: 10485760, end: 11534335, gid: 0 }
      

      The output should have common fields at the start, and type-specific fields at the end, to simplify parsing by scripts (eg.awk can find fields by position).

      It would be useful to show both granted and blocked locks for each resource, so it is possible to see how much contention there is on each resource. The main intention would be to allow debugging why a client or server thread is blocked on a DLM lock by dumping the current holder(s) and waiter(s) on a particular lock resource, and then following the trail to the node(s) and thread PIDs holding a lock on the resource to see why it is not releasing the lock.

      This would also be useful on the MDS for dumping all of the flock locks in the system for debugging/monitoring.

      Writing "clear" or "0" into one of the .<resid> files would either cancel all locks on the resource (if on the client and no references are held), or send a blocking callback to clients locking that resource (if on the server). We might refine this to allow cancelling a specific local or remote lock handle on that resource by writing the specific local or peer <lockh> into .<resid>.

      Attachments

        Issue Links

          Activity

            People

              arshad512 Arshad Hussain
              adilger Andreas Dilger
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: