Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14379

ls hangs on client and console log message LNet: Service thread pid 25241 was inactive performing mdt_reint_rename

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • None
    • zfs-0.7
      kernel-3.10.0-1160.4.1.1chaos.ch6.x86_64
      lustre-2.12.5_10.llnl-3.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      At some time the morning of Jan 27th, we got reports of directory listings ("ls") hanging, where the directories were on MDT5 and MDT7.

      The console log of MDT5 and MDT7 both reported repeated watchdog dumps, all with very similar stacks. The first one on MDT7 appeared Thu Jan 21 12:01:33 2021

      # From zinc8 dmesg log
      LNet: Service thread pid 25241 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. D
      Pid: 25241, comm: mdt01_049 3.10.0-1160.4.1.1chaos.ch6.x86_64 #1 SMP Fri Oct 9 17:56:20 PDT 2020
      Call Trace:
       [<ffffffffc141a460>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
       [<ffffffffc141be2f>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc]
       [<ffffffffc141ef3e>] ldlm_cli_enqueue+0x40e/0x920 [ptlrpc]
       [<ffffffffc1982342>] osp_md_object_lock+0x162/0x2d0 [osp]
       [<ffffffffc1895194>] lod_object_lock+0xf4/0x780 [lod]
       [<ffffffffc1916ace>] mdd_object_lock+0x3e/0xe0 [mdd]
       [<ffffffffc17ae681>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
       [<ffffffffc17aec1a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
       [<ffffffffc17c407e>] mdt_rename_lock+0xbe/0x4b0 [mdt]
       [<ffffffffc17c6400>] mdt_reint_rename+0x2c0/0x2900 [mdt]
       [<ffffffffc17cf113>] mdt_reint_rec+0x83/0x210 [mdt]
       [<ffffffffc17ab303>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
       [<ffffffffc17b6b37>] mdt_reint+0x67/0x140 [mdt] 
       [<ffffffffc14b8b1a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
       [<ffffffffc145d80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
       [<ffffffffc1461bfd>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
       [<ffffffffadecafc1>] kthread+0xd1/0xe0
       [<ffffffffae5c1ff7>] ret_from_fork_nospec_end+0x0/0x39
       [<ffffffffffffffff>] 0xffffffffffffffff
      LustreError: dumping log to /tmp/lustre-log.1611259694.25241 

      The remaining dumps took a different path within ldlm_cli_enqueue:

      ptlrpc_set_wait+0x4d8/0x800 [ptlrpc]
      ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      ldlm_cli_enqueue+0x3d2/0x920 [ptlrpc]
      osp_md_object_lock+0x162/0x2d0 [osp]
      lod_object_lock+0xf4/0x780 [lod]
      mdd_object_lock+0x3e/0xe0 [mdd]
      mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      mdt_remote_object_lock+0x2a/0x30 [mdt]
      mdt_rename_lock+0xbe/0x4b0 [mdt]
      mdt_reint_rename+0x2c0/0x2900 [mdt]
      mdt_reint_rec+0x83/0x210 [mdt]
      mdt_reint_internal+0x6e3/0xaf0 [mdt]
      mdt_reint+0x67/0x140 [mdt]
      tgt_request_handle+0xada/0x1570 [ptlrpc]
      ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      kthread+0xd1/0xe0
      ret_from_fork_nospec_end+0x0/0x39
      0xffffffffffffffff
      

      I can provide mdt7 debug logs and ldlm namespace dump; I have a core dump from mdt5.  And dmesg logs for both.

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: