Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • Lustre 2.7.0
    • Lustre 2.5.3
    • None
    • 3
    • 9223372036854775807

    Description

      On MDT LNet service thread hung Dumped stack trace

      an 20 13:42:10 nbp8-mds1 kernel: LNet: Service thread pid 17862 was inactive for 424.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Jan 20 13:42:10 nbp8-mds1 kernel: Pid: 17862, comm: mdt00_057
      Jan 20 13:42:14 nbp8-mds1 kernel: 
      Jan 20 13:42:14 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:14 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7cde4>] mdt_object_lock0+0x394/0xb30 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7d801>] mdt_object_find_lock+0x61/0x170 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0ea982c>] mdt_reint_open+0x88c/0x21a0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04fb5d6>] ? upcall_cache_get_entry+0x296/0x880 [libcfs]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0652ef0>] ? lu_ucred+0x20/0x30 [obdclass]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e71935>] ? mdt_ucred+0x15/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e8e51c>] ? mdt_root_squash+0x2c/0x3f0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e92481>] mdt_reint_rec+0x41/0xe0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e77ed3>] mdt_reint_internal+0x4c3/0x780 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7845e>] mdt_intent_reint+0x1ee/0x410 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:17 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      
      an 20 13:42:18 nbp8-mds1 kernel: Pid: 17780, comm: mdt00_044
      Jan 20 13:42:18 nbp8-mds1 kernel: 
      Jan 20 13:42:18 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff81061ff2>] ? default_wake_function+0x12/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:18 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Jan 20 13:42:19 nbp8-mds1 kernel: 
      Jan 20 13:42:19 nbp8-mds1 kernel: Pid: 16840, comm: mdt03_017
      Jan 20 13:42:19 nbp8-mds1 kernel: 
      Jan 20 13:42:19 nbp8-mds1 kernel: Call Trace:
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa076a885>] ? _ldlm_lock_debug+0x2d5/0x660 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff8155a9c2>] schedule_timeout+0x192/0x2e0
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffff81083300>] ? process_timeout+0x0/0x10
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078af70>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
      Jan 20 13:42:19 nbp8-mds1 kernel: [<ffffffffa078f7a1>] ldlm_completion_ast+0x4b1/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff81061fe0>] ? default_wake_function+0x0/0x20
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078ef00>] ldlm_cli_enqueue_local+0x1f0/0x5e0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7cc06>] mdt_object_lock0+0x1b6/0xb30 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e72de0>] ? mdt_blocking_ast+0x0/0x2a0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa078f2f0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7d644>] mdt_object_lock+0x14/0x20 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e85b8e>] mdt_getattr_name_lock+0x8fe/0x19d0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07df766>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ba7b4>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e86ef9>] mdt_intent_getattr+0x299/0x480 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e75c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa076f2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0798ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e76106>] mdt_enqueue+0x46/0xe0 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0e7aada>] mdt_handle_common+0x52a/0x1470 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa0eb74a5>] mds_regular_handle+0x15/0x20 [mdt]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c80c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa04f08d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c0a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07ca89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffffa07c9da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Jan 20 13:42:20 nbp8-mds1 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Looks the same as LU-7232

      I can upload debug logs if needed.

      Attachments

        1. lustre-log.1453326130.17862.gz
          0.3 kB
        2. mds.dmesg
          237 kB
        3. mds.processtrace
          2.01 MB

        Issue Links

          Activity

            [LU-7692] LNet: Service thread Hung

            It was not fixed by LU-7372. We upgraded to 2.7 and was not seen again.

            mhanafi Mahmoud Hanafi added a comment - It was not fixed by LU-7372 . We upgraded to 2.7 and was not seen again.

            Just to make sure: Is this no longer an issue because it was fixed by patch 17853 for LU-7372, or simply because it was never seen again (maybe it was fixed on the application side) ?

            spiechurski Sebastien Piechurski added a comment - Just to make sure: Is this no longer an issue because it was fixed by patch 17853 for LU-7372 , or simply because it was never seen again (maybe it was fixed on the application side) ?
            pjones Peter Jones added a comment -

            ok - thanks Mahmoud

            pjones Peter Jones added a comment - ok - thanks Mahmoud

            Please close this case. No longer an issue.

            mhanafi Mahmoud Hanafi added a comment - Please close this case. No longer an issue.

            We haven't tried 'disabling ELC' because the filesystem is in production. We know that this is triggered when clients are writing/read a large number of files and the clients get disconnected due to a network issue.

            mhanafi Mahmoud Hanafi added a comment - We haven't tried 'disabling ELC' because the filesystem is in production. We know that this is triggered when clients are writing/read a large number of files and the clients get disconnected due to a network issue.

            wait, does it imply it actually helped by disabling ELC?

            jay Jinshan Xiong (Inactive) added a comment - wait, does it imply it actually helped by disabling ELC?

            I need to understand the problem first and then propose a solution.

            jay Jinshan Xiong (Inactive) added a comment - I need to understand the problem first and then propose a solution.

            Can you provide any additional info on a permanent solution.

            mhanafi Mahmoud Hanafi added a comment - Can you provide any additional info on a permanent solution.
            jay Jinshan Xiong (Inactive) added a comment - - edited

            From the backtrace, ALL mdt threads were blocked at LDLM lock enqueue. This implied a deadlock case if the DLM lock that blocked those MDT threads was canceled by ELC so that it was piggy-backed in a RPC request, but that RPC request couldn't be served due to lack of MDT threads. This is really possible based on the behavior of that user;s application.

            Let's try to disable to ELC by writing this:

            lctl set_param ldlm.namespaces.*-mdc-*.early_lock_cancel=0
            

            on all client nodes.

            This may have some negative impact on performance, definitely should not try it at peak hours.

            One thing I have noticed is this message from the mds:

            <6>Lustre: nbp8-MDT0000: Recovery over after 2:29, of 12307 clients 12307 recovered and 0 were evicted.
            

            12307 clients were recovered in 2:29 minutes, this is really impressive.

            jay Jinshan Xiong (Inactive) added a comment - - edited From the backtrace, ALL mdt threads were blocked at LDLM lock enqueue. This implied a deadlock case if the DLM lock that blocked those MDT threads was canceled by ELC so that it was piggy-backed in a RPC request, but that RPC request couldn't be served due to lack of MDT threads. This is really possible based on the behavior of that user;s application. Let's try to disable to ELC by writing this: lctl set_param ldlm.namespaces.*-mdc-*.early_lock_cancel=0 on all client nodes. This may have some negative impact on performance, definitely should not try it at peak hours. One thing I have noticed is this message from the mds: <6>Lustre: nbp8-MDT0000: Recovery over after 2:29, of 12307 clients 12307 recovered and 0 were evicted. 12307 clients were recovered in 2:29 minutes, this is really impressive.

            Attaching the process trace and dmesg output.

            mds.processtrace
            mds.dmesg

            mhanafi Mahmoud Hanafi added a comment - Attaching the process trace and dmesg output. mds.processtrace mds.dmesg

            Hi Jay,

            can you please extract the log of all processes from crash dump and post it here?

            This doesn't look like a duplication of LU-7372.

            From the backtrace, the MDT thread is doing open; also it's conflicting with a PR lock with LOOK|UPDATE|LAYOUT, so the evicted client opened the file for execution. However, I don't know what type of lock the stuck MDT thread was requesting.

            jay Jinshan Xiong (Inactive) added a comment - Hi Jay, can you please extract the log of all processes from crash dump and post it here? This doesn't look like a duplication of LU-7372 . From the backtrace, the MDT thread is doing open; also it's conflicting with a PR lock with LOOK|UPDATE|LAYOUT, so the evicted client opened the file for execution. However, I don't know what type of lock the stuck MDT thread was requesting.

            People

              bobijam Zhenyu Xu
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: