Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16375

dump more information for threads blocked on local DLM locks

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 9223372036854775807

    Description

      When a server thread is blocked on a DLM lock, it is often difficult to see what the other threads in the system are doing with that lock, and why it is being held for a long time:

      lfs02-n29 kernel: Pid: 16214, comm: ll_ost11_002 3.10.0-1160.45.1.el7.x86_64
      lfs02-n29 kernel: Call Trace:
      lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
      lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd]
      lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd]
      lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc]
      lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc]
      lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]
      lfs02-n29 kernel: Pid: 16274, comm: ll_ost14_005 3.10.0-1160.45.1.el7.x86_64
      lfs02-n29 kernel: Call Trace:
      lfs02-n29 kernel: [<0>] ldlm_completion_ast+0x777/0x9d0 [ptlrpc]
      lfs02-n29 kernel: [<0>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      lfs02-n29 kernel: [<0>] ofd_destroy_by_fid+0x1d1/0x500 [ofd]
      lfs02-n29 kernel: [<0>] ofd_destroy_hdl+0x267/0xa00 [ofd]
      lfs02-n29 kernel: [<0>] tgt_request_handle+0x7f3/0x1760 [ptlrpc]
      lfs02-n29 kernel: [<0>] ptlrpc_server_handle_request+0x253/0xb30 [ptlrpc]
      lfs02-n29 kernel: [<0>] ptlrpc_main+0xb3c/0x14d0 [ptlrpc]
      

      and then later these threads time out on their DLM blocking AST, which dumps the resource FID for their request, but no information about which thread is holding the lock:

      lfs02-n29 kernel: LustreError: 16214:0:(ldlm_request.c:124:ldlm_expired_completion_wait())
          ### lock timed out (enqueued at 1670282438, 300s ago); not entering recovery in
          server code, just going back to sleep ns: filter-lfs02-OST0038_UUID
          lock: ffff984ae2538480/0x4bc7fa9cb92be450 lrc: 3/0,1 mode: --/PW
          res: [0x800000410:0x813652:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
          (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local
          remote: 0x0 expref: -99 pid: 16214 timeout: 0 lvb_type: 0
      lfs02-n29 kernel: LustreError: 16274:0:(ldlm_request.c:124:ldlm_expired_completion_wait())
          ### lock timed out (enqueued at 1670282473, 300s ago); not entering recovery in
          server code, just going back to sleep ns: filter-lfs02-OST0039_UUID
          lock: ffff98394b735680/0x4bc7fa9cb94f8af9 lrc: 3/0,1 mode: --/PW
          res: [0x84000040c:0x8b52d3:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615]
          (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local
          remote: 0x0 expref: -99 pid: 16274 timeout: 0 lvb_type: 0
      

      Since we know this is a local lock holder ("not entering recovery in server code") it should be possible to add LDLM_ERROR() printing of the conflicting locks held on that resource, and libcfs_debug_dumpstack() for the PID(s) that are holding the lock(s).

      Attachments

        Issue Links

          Activity

            People

              arshad512 Arshad Hussain
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: