Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13331

LNet: Service thread pid 23940 was inactive for 756.26s; slow/unusable filesystem

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.2
    • None
    • OS: CentOS Linux release 7.6.1810
      Kernel: 3.10.0-957.10.1.el7_lustre.x86_64
      OPA DRIVER: IntelOPA-IFS.RHEL67-x86_64.10.6.1.0.2
      lustre version: 2.12.2(clients&server)
    • 3
    • 9223372036854775807

    Description

      yesterday morning, the filesystem is hung, users can not login . 
      on the master mds we found
      1)ldlm_bl_06 process used 100% cpu
      2) I can see plenty of these on the MDS

      [Wed Mar 4 11:50:45 2020] LNet: Service thread pid 22287 was inactive for 200.50s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Mar 4 11:50:45 2020] LNet: Skipped 2 previous similar messages
      [Wed Mar 4 11:50:45 2020] Pid: 22287, comm: mdt01_006 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019
      [Wed Mar 4 11:50:45 2020] Call Trace:
      [Wed Mar 4 11:50:45 2020] [<ffffffffc177bd11>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc177caac>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc1b08e7b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc1b0b055>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc1b13fa3>] mdt_reint_rec+0x83/0x210 [mdt]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc1af21b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc1afd497>] mdt_reint+0x67/0x140 [mdt]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc18181da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc17bd80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Mar 4 11:50:45 2020] [<ffffffffc17c113c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Mar 4 11:50:45 2020] [<ffffffff86cc1c71>] kthread+0xd1/0xe0
      [Wed Mar 4 11:50:45 2020] [<ffffffff87375c37>] ret_from_fork_nospec_end+0x0/0x39
      [Wed Mar 4 11:50:45 2020] [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Mar 4 11:50:45 2020] LustreError: dumping log to /tmp/lustre-log.1583293917.22287
      [Wed Mar 4 11:50:46 2020] Lustre: 23889:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      req@ffff974059b52400 x1645558396906544/t0(0) o36->c005a9c0-5f7b-3d89-e73f-3927a52e3780@10.10.12.193@o2ib:33/0 lens 552/2888 e 3 to 0 dl 1583293923 ref 2 fl Interpret:/0/0 rc 0/0
      [Wed Mar 4 11:50:46 2020] Lustre: 23889:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 23 previous similar messages
      [Wed Mar 4 11:50:48 2020] Lustre: 23889:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      req@ffff973d487ee780 x1640595607717776/t0(0) o36->55d78ce2-0dfb-0ab1-66c2-1b76b41fa7ae@10.10.12.75@o2ib:35/0 lens 512/2888 e 3 to 0 dl 1583293925 ref 2 fl Interpret:/0/0 rc 0/0
      [Wed Mar 4 11:50:48 2020] Lustre: 23889:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [Wed Mar 4 11:52:04 2020] LNet: Service thread pid 23963 was inactive for 1202.44s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Mar 4 11:52:04 2020] Pid: 23963, comm: mdt00_121 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019
      [Wed Mar 4 11:52:04 2020] Call Trace:
      [Wed Mar 4 11:52:04 2020] [<ffffffffc177bd9d>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc177caac>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1af6458>] mdt_object_local_lock+0x438/0xb20 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1af6bb0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1af6ec0>] mdt_object_lock+0x20/0x30 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1b201e8>] mdt_reint_open+0xc58/0x27d0 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1b13fa3>] mdt_reint_rec+0x83/0x210 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1af21b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1afe972>] mdt_intent_open+0x82/0x3a0 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1afca18>] mdt_intent_policy+0x2e8/0xd00 [mdt]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1762d26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc178b587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc1813882>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc18181da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc17bd80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffffc17c113c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Mar 4 11:52:04 2020] [<ffffffff86cc1c71>] kthread+0xd1/0xe0
      [Wed Mar 4 11:52:04 2020] [<ffffffff87375c37>] ret_from_fork_nospec_end+0x0/0x39
      [Wed Mar 4 11:52:04 2020] [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Mar 4 11:52:04 2020] LustreError: dumping log to /tmp/lustre-log.1583293996.23963
      [Wed Mar 4 11:52:27 2020] Lustre: 23899:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
      req@ffff9722ee14a880 x1640598478220640/t0(0) o36->f3ac6915-c5e4-e041-09ac-cc9be8e437cc@10.10.12.80@o2ib:134/0 lens 504/2888 e 0 to 0 dl 1583294024 ref 2 fl Interpret:/0/0 rc 0/0
      [Wed Mar 4 11:52:39 2020] Lustre: 23887:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
      req@ffff970b04b8cc80 x1659681631099232/t0(0) o36->31d8584f-45e2-51a6-2860-a2e01fed9e33@10.10.11.102@o2ib:146/0 lens 536/2888 e 0 to 0 dl 1583294036 ref 2 fl Interpret:/0/0 rc 0/0
      [Wed Mar 4 11:52:39 2020] Lustre: 23887:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      [Wed Mar 4 11:52:47 2020] LNet: Service thread pid 23933 was inactive for 757.61s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [Wed Mar 4 11:52:47 2020] Pid: 23933, comm: mdt01_098 3.10.0-957.10.1.el7_lustre.x86_64 #1 SMP Sun May 26 21:48:35 UTC 2019
      [Wed Mar 4 11:52:47 2020] Call Trace:
      [Wed Mar 4 11:52:47 2020] [<ffffffffc177bd9d>] ldlm_completion_ast+0x63d/0x920 [ptlrpc]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc177caac>] ldlm_cli_enqueue_local+0x23c/0x870 [ptlrpc]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc1b08e7b>] mdt_rename_lock+0x24b/0x4b0 [mdt]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc1b0b055>] mdt_reint_rename+0x2d5/0x28e0 [mdt]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc1b13fa3>] mdt_reint_rec+0x83/0x210 [mdt]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc1af21b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc1afd497>] mdt_reint+0x67/0x140 [mdt]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc18181da>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc17bd80b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [Wed Mar 4 11:52:47 2020] [<ffffffffc17c113c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
      [Wed Mar 4 11:52:47 2020] [<ffffffff86cc1c71>] kthread+0xd1/0xe0
      [Wed Mar 4 11:52:47 2020] [<ffffffff87375c37>] ret_from_fork_nospec_end+0x0/0x39
      [Wed Mar 4 11:52:47 2020] [<ffffffffffffffff>] 0xffffffffffffffff
      [Wed Mar 4 11:52:47 2020] LustreError: dumping log to /tmp/lustre-log.1583294040.23933
      [Wed Mar 4 11:53:06 2020] Lustre: 22294:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      req@ffff9740f7c8e300 x1658466851517344/t0(0) o36->6df80542-1d99-efd6-1a0d-51dc49211bc2@10.10.11.119@o2ib:173/0 lens 544/2888 e 1 to 0 dl 1583294063 ref 2 fl Interpret:/0/0 rc 0/0

       

      reboot master mds or take over the master mds can solve this issue.

      Any help would be appreciated. Thanks!

      Attachments

        1. lustre-log.1583293917.22287.decode
          15.43 MB
          Xiao Zhenggang
        2. lustre-log.1583293996.23963.decode
          1.02 MB
          Xiao Zhenggang
        3. lustre-log.1583294040.23933.decode
          615 kB
          Xiao Zhenggang

        Activity

          People

            wc-triage WC Triage
            xiaozg Xiao Zhenggang
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: