Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15148

Service thread pid 6006 was inactive for 200.10s; filesystem hangs for specific MDT but eventually works

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.7
    • TOSS 3.7-11
      3.10.0-1160.36.2.1chaos.ch6.x86_64
      lustre 2.12.7_2.llnl
      zfs-0.7.11-9.8llnl.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      A user was first unable to ls his home directory. ls hung for a long time, but eventually worked. Then, he had the same issue for a sub-directory called MDT3, which is set to have it's meta-data on MDT3. After the initial ls which took several minutes, subsequent ls attempts returned quickly.

      The host for MDT3 is jet4, and jet4 was the only server that was generating the following logs messages.

      Here's the stack trace from the first time it happened.

      2021-10-20 12:06:19 [3552825.135826] LNet: Service thread pid 21159 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. Dum\
      ping the stack trace for debugging purposes:
      2021-10-20 12:06:19 [3552825.156754] Pid: 21159, comm: mdt01_059 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-20 12:06:19 [3552825.168961] Call Trace:
      2021-10-20 12:06:19 [3552825.172738]  [<ffffffffc13352e1>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
      2021-10-20 12:06:19 [3552825.181613]  [<ffffffffc13156d9>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
      2021-10-20 12:06:19 [3552825.189848]  [<ffffffffc16bee2c>] lod_trans_stop+0x25c/0x340 [lod]
      2021-10-20 12:06:19 [3552825.197769]  [<ffffffffc1771b5e>] mdd_trans_stop+0x2e/0x174 [mdd]
      2021-10-20 12:06:19 [3552825.205594]  [<ffffffffc1755b99>] mdd_create+0x11a9/0x14a0 [mdd]
      2021-10-20 12:06:19 [3552825.213311]  [<ffffffffc160fc54>] mdt_create+0xb54/0x1090 [mdt]
      2021-10-20 12:06:19 [3552825.220936]  [<ffffffffc16102fb>] mdt_reint_create+0x16b/0x360 [mdt]
      2021-10-20 12:06:19 [3552825.229030]  [<ffffffffc1612db3>] mdt_reint_rec+0x83/0x210 [mdt]
      2021-10-20 12:06:19 [3552825.236714]  [<ffffffffc15ee4b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      2021-10-20 12:06:19 [3552825.244990]  [<ffffffffc15f9cc7>] mdt_reint+0x67/0x140 [mdt]
      2021-10-20 12:06:19 [3552825.252289]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-20 12:06:19 [3552825.260986]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-20 12:06:19 [3552825.270534]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-20 12:06:19 [3552825.278536]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-20 12:06:19 [3552825.284919]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-20 12:06:19 [3552825.292863]  [<ffffffffffffffff>] 0xffffffffffffffff

       

      After the first time, the stack trace changed to:

      2021-10-21 16:27:57 [3654921.183865] LNet: Service thread pid 6006 was inactive for 200.10s. The thread might be hung, or it might only be slow and will resume later. Dump\
      ing the stack trace for debugging purposes:
      2021-10-21 16:27:57 [3654921.204706] Pid: 6006, comm: mdt00_034 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-21 16:27:57 [3654921.216836] Call Trace:
      2021-10-21 16:27:57 [3654921.220584]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
      2021-10-21 16:27:57 [3654921.229352]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      2021-10-21 16:27:57 [3654921.238377]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      2021-10-21 16:27:57 [3654921.246994]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      2021-10-21 16:27:57 [3654921.255788]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
      2021-10-21 16:27:57 [3654921.264477]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2021-10-21 16:27:57 [3654921.272769]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
      2021-10-21 16:27:57 [3654921.280652]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      2021-10-21 16:27:57 [3654921.288834]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
      2021-10-21 16:27:57 [3654921.297322]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      2021-10-21 16:27:57 [3654921.306178]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2021-10-21 16:27:57 [3654921.313963]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-21 16:27:57 [3654921.322597]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-21 16:27:57 [3654921.332123]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-21 16:27:57 [3654921.340086]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-21 16:27:57 [3654921.346479]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-21 16:27:57 [3654921.354429]  [<ffffffffffffffff>] 0xffffffffffffffff

      There are a bunch of these in the logs from the last few days, and the processes still exists and their stacks are still stuck in the same state as the dump.

        

      The problem was noticed this morning and the full logs for that ls attempt are

      2021-10-22 08:09:57 [3711440.339065] LNet: Service thread pid 21278 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dum\
      ping the stack trace for debugging purposes:
      2021-10-22 08:09:57 [3711440.360090] Pid: 21278, comm: mdt00_123 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
      2021-10-22 08:09:57 [3711440.372373] Call Trace:
      2021-10-22 08:09:57 [3711440.376201]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
      2021-10-22 08:09:57 [3711440.385068]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
      2021-10-22 08:09:57 [3711440.394166]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
      2021-10-22 08:09:57 [3711440.402898]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      2021-10-22 08:09:57 [3711440.411797]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
      2021-10-22 08:09:57 [3711440.420608]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      2021-10-22 08:09:57 [3711440.429022]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
      2021-10-22 08:09:57 [3711440.437045]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      2021-10-22 08:09:57 [3711440.445338]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
      2021-10-22 08:09:57 [3711440.453922]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
      2021-10-22 08:09:57 [3711440.462899]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
      2021-10-22 08:09:57 [3711440.470790]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      2021-10-22 08:09:57 [3711440.479509]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      2021-10-22 08:09:57 [3711440.489097]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
      2021-10-22 08:09:57 [3711440.497117]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
      2021-10-22 08:09:57 [3711440.503545]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
      2021-10-22 08:09:57 [3711440.511516]  [<ffffffffffffffff>] 0xffffffffffffffff
      2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278
      2021-10-22 08:11:36 [3711539.849250] LustreError: 21278:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634915196, 300s ago); not en\
      tering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0003_UUID lock: ffff9f5eade79b00/0xc43735e83ac402f0 lrc: 3/1,0 mode: --/PR res: [0x4c00c7f52:0x5\
      :0x0].0x0 bits 0x12/0x0 rrc: 22 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21278 timeout: 0 lvb_type: 0
      2021-10-22 08:16:31 [3711834.827788] Lustre: 21121:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2021-10-22 08:16:31 [3711834.827788]   req@ffff9f25e8a75100 x1714259348772800/t0(0) o101->98d2fd76-058a-227d-d9be-65ffdb9bb578@192.168.128.124@o2ib18:291/0 lens 576/3272 e\
       22 to 0 dl 1634915796 ref 2 fl Interpret:/0/0 rc 0/0
      

       

      There are some similar messages on jet4 before 2021-10-20, but not very many and not as frequently. There are very few such similar messages on other machines in jet, most have none this month.

       

       

      Attachments

        Activity

          [LU-15148] Service thread pid 6006 was inactive for 200.10s; filesystem hangs for specific MDT but eventually works

          We reran the tests that we believe resulted in the first error, but were unable to reproduce the error.

          The differences were that the debug levels rpctrace dlmtrace ha vfstrace were set, and that one node that had failed over had been restored. The node that had failed over did so over a month ago, well before 2021-10-20 when this issue seems to have started.

          defazio Gian-Carlo Defazio added a comment - We reran the tests that we believe resulted in the first error, but were unable to reproduce the error. The differences were that the debug levels rpctrace dlmtrace ha vfstrace were set, and that one node that had failed over had been restored. The node that had failed over did so over a month ago, well before 2021-10-20 when this issue seems to have started.

          For my records, the local ticket is TOSS-5346

          defazio Gian-Carlo Defazio added a comment - For my records, the local ticket is TOSS-5346
          green Oleg Drokin added a comment -

          So can we assume you have a way to reliably hit this? Would it be feasible to increase debug levels on the client and servers just before this first access happens and then dump lustre debug logs when complete?

          It's strange that there's no evictions or anything in the logs and the lock being held is a server side lock.

          The extra debug levels we want are: rpctrace dlmtrace ha vfstrace

           

          Do increase debug_mb to 1000 as well please and then once done dump the log with lctl dk. Ideally you'll do this on all MDS and OST servers and the client that initiated that first request.

          green Oleg Drokin added a comment - So can we assume you have a way to reliably hit this? Would it be feasible to increase debug levels on the client and servers just before this first access happens and then dump lustre debug logs when complete? It's strange that there's no evictions or anything in the logs and the lock being held is a server side lock. The extra debug levels we want are: rpctrace dlmtrace ha vfstrace   Do increase debug_mb to 1000 as well please and then once done dump the log with lctl dk. Ideally you'll do this on all MDS and OST servers and the client that initiated that first request.

          What do you mean by create->trans stop path?

          Here's me grepping for stack dumps in October

          2021-10-01 02:24:35 [1876351.651648] LustreError: dumping log to /tmp/lustre-log.1633080275.21241
          2021-10-01 15:40:12 [1924088.038346] LustreError: dumping log to /tmp/lustre-log.1633128012.12376
          2021-10-04 14:46:28 [2180059.422390] LustreError: dumping log to /tmp/lustre-log.1633383988.31179
          2021-10-05 04:12:51 [2228441.432712] LustreError: dumping log to /tmp/lustre-log.1633432371.19181
          2021-10-20 12:06:19 [3552825.299378] LustreError: dumping log to /tmp/lustre-log.1634756779.21159
          2021-10-21 01:33:05 [3601230.956954] LustreError: dumping log to /tmp/lustre-log.1634805185.21311
          2021-10-21 01:34:45 [3601330.752850] LustreError: dumping log to /tmp/lustre-log.1634805285.21311
          2021-10-21 07:47:19 [3623683.794004] LustreError: dumping log to /tmp/lustre-log.1634827639.16605
          2021-10-21 08:21:17 [3625722.539271] LustreError: dumping log to /tmp/lustre-log.1634829677.16601
          2021-10-21 08:58:26 [3627950.722608] LustreError: dumping log to /tmp/lustre-log.1634831906.21124
          2021-10-21 09:41:07 [3630512.212157] LustreError: dumping log to /tmp/lustre-log.1634834467.31952
          2021-10-21 12:45:17 [3641561.995392] LustreError: dumping log to /tmp/lustre-log.1634845517.21154
          2021-10-21 12:57:34 [3642298.747487] LustreError: dumping log to /tmp/lustre-log.1634846254.21276
          2021-10-21 14:34:54 [3648138.512019] LustreError: dumping log to /tmp/lustre-log.1634852094.21240
          2021-10-21 16:27:57 [3654921.360927] LustreError: dumping log to /tmp/lustre-log.1634858877.6006
          2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278
          2021-10-22 08:39:53 [3713236.067411] LustreError: dumping log to /tmp/lustre-log.1634917192.16602
          2021-10-22 09:18:54 [3715577.400409] LustreError: dumping log to /tmp/lustre-log.1634919534.21261
          

          By first stack trace I mean the one the occurs at 2021-10-20 12:06:19 and is the first stack trace shown in the description.

          Starting 2021-10-21 01:33:05, that stack trace, and all subsequent traces look like the second stack trace show in the description.

          defazio Gian-Carlo Defazio added a comment - What do you mean by create->trans stop path? Here's me grepping for stack dumps in October 2021-10-01 02:24:35 [1876351.651648] LustreError: dumping log to /tmp/lustre-log.1633080275.21241 2021-10-01 15:40:12 [1924088.038346] LustreError: dumping log to /tmp/lustre-log.1633128012.12376 2021-10-04 14:46:28 [2180059.422390] LustreError: dumping log to /tmp/lustre-log.1633383988.31179 2021-10-05 04:12:51 [2228441.432712] LustreError: dumping log to /tmp/lustre-log.1633432371.19181 2021-10-20 12:06:19 [3552825.299378] LustreError: dumping log to /tmp/lustre-log.1634756779.21159 2021-10-21 01:33:05 [3601230.956954] LustreError: dumping log to /tmp/lustre-log.1634805185.21311 2021-10-21 01:34:45 [3601330.752850] LustreError: dumping log to /tmp/lustre-log.1634805285.21311 2021-10-21 07:47:19 [3623683.794004] LustreError: dumping log to /tmp/lustre-log.1634827639.16605 2021-10-21 08:21:17 [3625722.539271] LustreError: dumping log to /tmp/lustre-log.1634829677.16601 2021-10-21 08:58:26 [3627950.722608] LustreError: dumping log to /tmp/lustre-log.1634831906.21124 2021-10-21 09:41:07 [3630512.212157] LustreError: dumping log to /tmp/lustre-log.1634834467.31952 2021-10-21 12:45:17 [3641561.995392] LustreError: dumping log to /tmp/lustre-log.1634845517.21154 2021-10-21 12:57:34 [3642298.747487] LustreError: dumping log to /tmp/lustre-log.1634846254.21276 2021-10-21 14:34:54 [3648138.512019] LustreError: dumping log to /tmp/lustre-log.1634852094.21240 2021-10-21 16:27:57 [3654921.360927] LustreError: dumping log to /tmp/lustre-log.1634858877.6006 2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278 2021-10-22 08:39:53 [3713236.067411] LustreError: dumping log to /tmp/lustre-log.1634917192.16602 2021-10-22 09:18:54 [3715577.400409] LustreError: dumping log to /tmp/lustre-log.1634919534.21261 By first stack trace I mean the one the occurs at 2021-10-20 12:06:19 and is the first stack trace shown in the description. Starting 2021-10-21 01:33:05 , that stack trace, and all subsequent traces look like the second stack trace show in the description.
          green Oleg Drokin added a comment -

          Just to confirm, when you say "After the first time, the stack trace changed to" does that mean the very first stack trace is always the one in the create->trans stop path?

          green Oleg Drokin added a comment - Just to confirm, when you say "After the first time, the stack trace changed to" does that mean the very first stack trace is always the one in the create->trans stop path?

          People

            green Oleg Drokin
            defazio Gian-Carlo Defazio
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: