Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15148

Service thread pid 6006 was inactive for 200.10s; filesystem hangs for specific MDT but eventually works

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.7
    • TOSS 3.7-11
      3.10.0-1160.36.2.1chaos.ch6.x86_64
      lustre 2.12.7_2.llnl
      zfs-0.7.11-9.8llnl.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      A user was first unable to ls his home directory. ls hung for a long time, but eventually worked. Then, he had the same issue for a sub-directory called MDT3, which is set to have it's meta-data on MDT3. After the initial ls which took several minutes, subsequent ls attempts returned quickly.

      The host for MDT3 is jet4, and jet4 was the only server that was generating the following logs messages.

      Here's the stack trace from the first time it happened.

      2021-10-20 12:06:19 [3552825.135826] LNet: Service thread pid 21159 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. Dum\
      ping the stack trace for debugging purposes:
      2021-10-20 12:06:19 [3552825.156754] Pid: 21159, comm: mdt01_059 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-20 12:06:19 [3552825.168961] Call Trace:
      2021-10-20 12:06:19 [3552825.172738]  [<ffffffffc13352e1>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
      2021-10-20 12:06:19 [3552825.181613]  [<ffffffffc13156d9>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
      2021-10-20 12:06:19 [3552825.189848]  [<ffffffffc16bee2c>] lod_trans_stop+0x25c/0x340 [lod]
      2021-10-20 12:06:19 [3552825.197769]  [<ffffffffc1771b5e>] mdd_trans_stop+0x2e/0x174 [mdd]
      2021-10-20 12:06:19 [3552825.205594]  [<ffffffffc1755b99>] mdd_create+0x11a9/0x14a0 [mdd]
      2021-10-20 12:06:19 [3552825.213311]  [<ffffffffc160fc54>] mdt_create+0xb54/0x1090 [mdt]
      2021-10-20 12:06:19 [3552825.220936]  [<ffffffffc16102fb>] mdt_reint_create+0x16b/0x360 [mdt]
      2021-10-20 12:06:19 [3552825.229030]  [<ffffffffc1612db3>] mdt_reint_rec+0x83/0x210 [mdt]
      2021-10-20 12:06:19 [3552825.236714]  [<ffffffffc15ee4b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      2021-10-20 12:06:19 [3552825.244990]  [<ffffffffc15f9cc7>] mdt_reint+0x67/0x140 [mdt]
      2021-10-20 12:06:19 [3552825.252289]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-20 12:06:19 [3552825.260986]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-20 12:06:19 [3552825.270534]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-20 12:06:19 [3552825.278536]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-20 12:06:19 [3552825.284919]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-20 12:06:19 [3552825.292863]  [<ffffffffffffffff>] 0xffffffffffffffff

       

      After the first time, the stack trace changed to:

      2021-10-21 16:27:57 [3654921.183865] LNet: Service thread pid 6006 was inactive for 200.10s. The thread might be hung, or it might only be slow and will resume later. Dump\
      ing the stack trace for debugging purposes:
      2021-10-21 16:27:57 [3654921.204706] Pid: 6006, comm: mdt00_034 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-21 16:27:57 [3654921.216836] Call Trace:
      2021-10-21 16:27:57 [3654921.220584]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
      2021-10-21 16:27:57 [3654921.229352]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      2021-10-21 16:27:57 [3654921.238377]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      2021-10-21 16:27:57 [3654921.246994]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      2021-10-21 16:27:57 [3654921.255788]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
      2021-10-21 16:27:57 [3654921.264477]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2021-10-21 16:27:57 [3654921.272769]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
      2021-10-21 16:27:57 [3654921.280652]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      2021-10-21 16:27:57 [3654921.288834]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
      2021-10-21 16:27:57 [3654921.297322]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      2021-10-21 16:27:57 [3654921.306178]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2021-10-21 16:27:57 [3654921.313963]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-21 16:27:57 [3654921.322597]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-21 16:27:57 [3654921.332123]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-21 16:27:57 [3654921.340086]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-21 16:27:57 [3654921.346479]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-21 16:27:57 [3654921.354429]  [<ffffffffffffffff>] 0xffffffffffffffff

      There are a bunch of these in the logs from the last few days, and the processes still exists and their stacks are still stuck in the same state as the dump.

        

      The problem was noticed this morning and the full logs for that ls attempt are

      2021-10-22 08:09:57 [3711440.339065] LNet: Service thread pid 21278 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dum\
      ping the stack trace for debugging purposes:
      2021-10-22 08:09:57 [3711440.360090] Pid: 21278, comm: mdt00_123 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-22 08:09:57 [3711440.372373] Call Trace:
      2021-10-22 08:09:57 [3711440.376201]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
      2021-10-22 08:09:57 [3711440.385068]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      2021-10-22 08:09:57 [3711440.394166]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      2021-10-22 08:09:57 [3711440.402898]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      2021-10-22 08:09:57 [3711440.411797]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
      2021-10-22 08:09:57 [3711440.420608]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2021-10-22 08:09:57 [3711440.429022]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
      2021-10-22 08:09:57 [3711440.437045]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      2021-10-22 08:09:57 [3711440.445338]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
      2021-10-22 08:09:57 [3711440.453922]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      2021-10-22 08:09:57 [3711440.462899]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2021-10-22 08:09:57 [3711440.470790]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-22 08:09:57 [3711440.479509]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-22 08:09:57 [3711440.489097]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-22 08:09:57 [3711440.497117]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-22 08:09:57 [3711440.503545]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-22 08:09:57 [3711440.511516]  [<ffffffffffffffff>] 0xffffffffffffffff
      2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278
      2021-10-22 08:11:36 [3711539.849250] LustreError: 21278:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634915196, 300s ago); not en\
      tering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0003_UUID lock: ffff9f5eade79b00/0xc43735e83ac402f0 lrc: 3/1,0 mode: --/PR res: [0x4c00c7f52:0x5\
      :0x0].0x0 bits 0x12/0x0 rrc: 22 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21278 timeout: 0 lvb_type: 0
      2021-10-22 08:16:31 [3711834.827788] Lustre: 21121:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2021-10-22 08:16:31 [3711834.827788]   req@ffff9f25e8a75100 x1714259348772800/t0(0) o101->98d2fd76-058a-227d-d9be-65ffdb9bb578@192.168.128.124@o2ib18:291/0 lens 576/3272 e\
       22 to 0 dl 1634915796 ref 2 fl Interpret:/0/0 rc 0/0
      

       

      There are some similar messages on jet4 before 2021-10-20, but not very many and not as frequently. There are very few such similar messages on other machines in jet, most have none this month.

       

       

      Attachments

        Activity

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: