[LU-15148] Service thread pid 6006 was inactive for 200.10s; filesystem hangs for specific MDT but eventually works Created: 22/Oct/21 Updated: 06/Nov/21 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Gian-Carlo Defazio | Assignee: | Oleg Drokin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
TOSS 3.7-11 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
A user was first unable to ls his home directory. ls hung for a long time, but eventually worked. Then, he had the same issue for a sub-directory called MDT3, which is set to have it's meta-data on MDT3. After the initial ls which took several minutes, subsequent ls attempts returned quickly. The host for MDT3 is jet4, and jet4 was the only server that was generating the following logs messages. Here's the stack trace from the first time it happened. 2021-10-20 12:06:19 [3552825.135826] LNet: Service thread pid 21159 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. Dum\ ping the stack trace for debugging purposes: 2021-10-20 12:06:19 [3552825.156754] Pid: 21159, comm: mdt01_059 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021 2021-10-20 12:06:19 [3552825.168961] Call Trace: 2021-10-20 12:06:19 [3552825.172738] [<ffffffffc13352e1>] top_trans_wait_result+0xa9/0x158 [ptlrpc] 2021-10-20 12:06:19 [3552825.181613] [<ffffffffc13156d9>] top_trans_stop+0x4e9/0xa70 [ptlrpc] 2021-10-20 12:06:19 [3552825.189848] [<ffffffffc16bee2c>] lod_trans_stop+0x25c/0x340 [lod] 2021-10-20 12:06:19 [3552825.197769] [<ffffffffc1771b5e>] mdd_trans_stop+0x2e/0x174 [mdd] 2021-10-20 12:06:19 [3552825.205594] [<ffffffffc1755b99>] mdd_create+0x11a9/0x14a0 [mdd] 2021-10-20 12:06:19 [3552825.213311] [<ffffffffc160fc54>] mdt_create+0xb54/0x1090 [mdt] 2021-10-20 12:06:19 [3552825.220936] [<ffffffffc16102fb>] mdt_reint_create+0x16b/0x360 [mdt] 2021-10-20 12:06:19 [3552825.229030] [<ffffffffc1612db3>] mdt_reint_rec+0x83/0x210 [mdt] 2021-10-20 12:06:19 [3552825.236714] [<ffffffffc15ee4b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt] 2021-10-20 12:06:19 [3552825.244990] [<ffffffffc15f9cc7>] mdt_reint+0x67/0x140 [mdt] 2021-10-20 12:06:19 [3552825.252289] [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc] 2021-10-20 12:06:19 [3552825.260986] [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] 2021-10-20 12:06:19 [3552825.270534] [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc] 2021-10-20 12:06:19 [3552825.278536] [<ffffffffba8cb201>] kthread+0xd1/0xe0 2021-10-20 12:06:19 [3552825.284919] [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39 2021-10-20 12:06:19 [3552825.292863] [<ffffffffffffffff>] 0xffffffffffffffff
After the first time, the stack trace changed to: 2021-10-21 16:27:57 [3654921.183865] LNet: Service thread pid 6006 was inactive for 200.10s. The thread might be hung, or it might only be slow and will resume later. Dump\ ing the stack trace for debugging purposes: 2021-10-21 16:27:57 [3654921.204706] Pid: 6006, comm: mdt00_034 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021 2021-10-21 16:27:57 [3654921.216836] Call Trace: 2021-10-21 16:27:57 [3654921.220584] [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc] 2021-10-21 16:27:57 [3654921.229352] [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] 2021-10-21 16:27:57 [3654921.238377] [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt] 2021-10-21 16:27:57 [3654921.246994] [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt] 2021-10-21 16:27:57 [3654921.255788] [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt] 2021-10-21 16:27:57 [3654921.264477] [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt] 2021-10-21 16:27:57 [3654921.272769] [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt] 2021-10-21 16:27:57 [3654921.280652] [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt] 2021-10-21 16:27:57 [3654921.288834] [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] 2021-10-21 16:27:57 [3654921.297322] [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] 2021-10-21 16:27:57 [3654921.306178] [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc] 2021-10-21 16:27:57 [3654921.313963] [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc] 2021-10-21 16:27:57 [3654921.322597] [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] 2021-10-21 16:27:57 [3654921.332123] [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc] 2021-10-21 16:27:57 [3654921.340086] [<ffffffffba8cb201>] kthread+0xd1/0xe0 2021-10-21 16:27:57 [3654921.346479] [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39 2021-10-21 16:27:57 [3654921.354429] [<ffffffffffffffff>] 0xffffffffffffffff There are a bunch of these in the logs from the last few days, and the processes still exists and their stacks are still stuck in the same state as the dump.
The problem was noticed this morning and the full logs for that ls attempt are 2021-10-22 08:09:57 [3711440.339065] LNet: Service thread pid 21278 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dum\ ping the stack trace for debugging purposes: 2021-10-22 08:09:57 [3711440.360090] Pid: 21278, comm: mdt00_123 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021 2021-10-22 08:09:57 [3711440.372373] Call Trace: 2021-10-22 08:09:57 [3711440.376201] [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc] 2021-10-22 08:09:57 [3711440.385068] [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] 2021-10-22 08:09:57 [3711440.394166] [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt] 2021-10-22 08:09:57 [3711440.402898] [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt] 2021-10-22 08:09:57 [3711440.411797] [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt] 2021-10-22 08:09:57 [3711440.420608] [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt] 2021-10-22 08:09:57 [3711440.429022] [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt] 2021-10-22 08:09:57 [3711440.437045] [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt] 2021-10-22 08:09:57 [3711440.445338] [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] 2021-10-22 08:09:57 [3711440.453922] [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] 2021-10-22 08:09:57 [3711440.462899] [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc] 2021-10-22 08:09:57 [3711440.470790] [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc] 2021-10-22 08:09:57 [3711440.479509] [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] 2021-10-22 08:09:57 [3711440.489097] [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc] 2021-10-22 08:09:57 [3711440.497117] [<ffffffffba8cb201>] kthread+0xd1/0xe0 2021-10-22 08:09:57 [3711440.503545] [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39 2021-10-22 08:09:57 [3711440.511516] [<ffffffffffffffff>] 0xffffffffffffffff 2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278 2021-10-22 08:11:36 [3711539.849250] LustreError: 21278:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634915196, 300s ago); not en\ tering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0003_UUID lock: ffff9f5eade79b00/0xc43735e83ac402f0 lrc: 3/1,0 mode: --/PR res: [0x4c00c7f52:0x5\ :0x0].0x0 bits 0x12/0x0 rrc: 22 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21278 timeout: 0 lvb_type: 0 2021-10-22 08:16:31 [3711834.827788] Lustre: 21121:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply 2021-10-22 08:16:31 [3711834.827788] req@ffff9f25e8a75100 x1714259348772800/t0(0) o101->98d2fd76-058a-227d-d9be-65ffdb9bb578@192.168.128.124@o2ib18:291/0 lens 576/3272 e\ 22 to 0 dl 1634915796 ref 2 fl Interpret:/0/0 rc 0/0
There are some similar messages on jet4 before 2021-10-20, but not very many and not as frequently. There are very few such similar messages on other machines in jet, most have none this month.
|
| Comments |
| Comment by Oleg Drokin [ 25/Oct/21 ] |
|
Just to confirm, when you say "After the first time, the stack trace changed to" does that mean the very first stack trace is always the one in the create->trans stop path? |
| Comment by Gian-Carlo Defazio [ 25/Oct/21 ] |
|
What do you mean by create->trans stop path? Here's me grepping for stack dumps in October 2021-10-01 02:24:35 [1876351.651648] LustreError: dumping log to /tmp/lustre-log.1633080275.21241 2021-10-01 15:40:12 [1924088.038346] LustreError: dumping log to /tmp/lustre-log.1633128012.12376 2021-10-04 14:46:28 [2180059.422390] LustreError: dumping log to /tmp/lustre-log.1633383988.31179 2021-10-05 04:12:51 [2228441.432712] LustreError: dumping log to /tmp/lustre-log.1633432371.19181 2021-10-20 12:06:19 [3552825.299378] LustreError: dumping log to /tmp/lustre-log.1634756779.21159 2021-10-21 01:33:05 [3601230.956954] LustreError: dumping log to /tmp/lustre-log.1634805185.21311 2021-10-21 01:34:45 [3601330.752850] LustreError: dumping log to /tmp/lustre-log.1634805285.21311 2021-10-21 07:47:19 [3623683.794004] LustreError: dumping log to /tmp/lustre-log.1634827639.16605 2021-10-21 08:21:17 [3625722.539271] LustreError: dumping log to /tmp/lustre-log.1634829677.16601 2021-10-21 08:58:26 [3627950.722608] LustreError: dumping log to /tmp/lustre-log.1634831906.21124 2021-10-21 09:41:07 [3630512.212157] LustreError: dumping log to /tmp/lustre-log.1634834467.31952 2021-10-21 12:45:17 [3641561.995392] LustreError: dumping log to /tmp/lustre-log.1634845517.21154 2021-10-21 12:57:34 [3642298.747487] LustreError: dumping log to /tmp/lustre-log.1634846254.21276 2021-10-21 14:34:54 [3648138.512019] LustreError: dumping log to /tmp/lustre-log.1634852094.21240 2021-10-21 16:27:57 [3654921.360927] LustreError: dumping log to /tmp/lustre-log.1634858877.6006 2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278 2021-10-22 08:39:53 [3713236.067411] LustreError: dumping log to /tmp/lustre-log.1634917192.16602 2021-10-22 09:18:54 [3715577.400409] LustreError: dumping log to /tmp/lustre-log.1634919534.21261 By first stack trace I mean the one the occurs at 2021-10-20 12:06:19 and is the first stack trace shown in the description. Starting 2021-10-21 01:33:05, that stack trace, and all subsequent traces look like the second stack trace show in the description. |
| Comment by Oleg Drokin [ 27/Oct/21 ] |
|
So can we assume you have a way to reliably hit this? Would it be feasible to increase debug levels on the client and servers just before this first access happens and then dump lustre debug logs when complete? It's strange that there's no evictions or anything in the logs and the lock being held is a server side lock. The extra debug levels we want are: rpctrace dlmtrace ha vfstrace
Do increase debug_mb to 1000 as well please and then once done dump the log with lctl dk. Ideally you'll do this on all MDS and OST servers and the client that initiated that first request. |
| Comment by Gian-Carlo Defazio [ 27/Oct/21 ] |
|
For my records, the local ticket is TOSS-5346 |
| Comment by Gian-Carlo Defazio [ 28/Oct/21 ] |
|
We reran the tests that we believe resulted in the first error, but were unable to reproduce the error. The differences were that the debug levels rpctrace dlmtrace ha vfstrace were set, and that one node that had failed over had been restored. The node that had failed over did so over a month ago, well before 2021-10-20 when this issue seems to have started. |