[LU-15148] Service thread pid 6006 was inactive for 200.10s; filesystem hangs for specific MDT but eventually works Created: 22/Oct/21  Updated: 06/Nov/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Gian-Carlo Defazio Assignee: Oleg Drokin
Resolution: Unresolved Votes: 0
Labels: llnl
Environment:

TOSS 3.7-11
3.10.0-1160.36.2.1chaos.ch6.x86_64
lustre 2.12.7_2.llnl
zfs-0.7.11-9.8llnl.ch6.x86_64


Attachments: File console.jet4    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A user was first unable to ls his home directory. ls hung for a long time, but eventually worked. Then, he had the same issue for a sub-directory called MDT3, which is set to have it's meta-data on MDT3. After the initial ls which took several minutes, subsequent ls attempts returned quickly.

The host for MDT3 is jet4, and jet4 was the only server that was generating the following logs messages.

Here's the stack trace from the first time it happened.

2021-10-20 12:06:19 [3552825.135826] LNet: Service thread pid 21159 was inactive for 200.13s. The thread might be hung, or it might only be slow and will resume later. Dum\
ping the stack trace for debugging purposes:
2021-10-20 12:06:19 [3552825.156754] Pid: 21159, comm: mdt01_059 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
2021-10-20 12:06:19 [3552825.168961] Call Trace:
2021-10-20 12:06:19 [3552825.172738]  [<ffffffffc13352e1>] top_trans_wait_result+0xa9/0x158 [ptlrpc]
2021-10-20 12:06:19 [3552825.181613]  [<ffffffffc13156d9>] top_trans_stop+0x4e9/0xa70 [ptlrpc]
2021-10-20 12:06:19 [3552825.189848]  [<ffffffffc16bee2c>] lod_trans_stop+0x25c/0x340 [lod]
2021-10-20 12:06:19 [3552825.197769]  [<ffffffffc1771b5e>] mdd_trans_stop+0x2e/0x174 [mdd]
2021-10-20 12:06:19 [3552825.205594]  [<ffffffffc1755b99>] mdd_create+0x11a9/0x14a0 [mdd]
2021-10-20 12:06:19 [3552825.213311]  [<ffffffffc160fc54>] mdt_create+0xb54/0x1090 [mdt]
2021-10-20 12:06:19 [3552825.220936]  [<ffffffffc16102fb>] mdt_reint_create+0x16b/0x360 [mdt]
2021-10-20 12:06:19 [3552825.229030]  [<ffffffffc1612db3>] mdt_reint_rec+0x83/0x210 [mdt]
2021-10-20 12:06:19 [3552825.236714]  [<ffffffffc15ee4b3>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
2021-10-20 12:06:19 [3552825.244990]  [<ffffffffc15f9cc7>] mdt_reint+0x67/0x140 [mdt]
2021-10-20 12:06:19 [3552825.252289]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
2021-10-20 12:06:19 [3552825.260986]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2021-10-20 12:06:19 [3552825.270534]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
2021-10-20 12:06:19 [3552825.278536]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
2021-10-20 12:06:19 [3552825.284919]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
2021-10-20 12:06:19 [3552825.292863]  [<ffffffffffffffff>] 0xffffffffffffffff

 

After the first time, the stack trace changed to:

2021-10-21 16:27:57 [3654921.183865] LNet: Service thread pid 6006 was inactive for 200.10s. The thread might be hung, or it might only be slow and will resume later. Dump\
ing the stack trace for debugging purposes:
2021-10-21 16:27:57 [3654921.204706] Pid: 6006, comm: mdt00_034 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
2021-10-21 16:27:57 [3654921.216836] Call Trace:
2021-10-21 16:27:57 [3654921.220584]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
2021-10-21 16:27:57 [3654921.229352]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
2021-10-21 16:27:57 [3654921.238377]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
2021-10-21 16:27:57 [3654921.246994]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
2021-10-21 16:27:57 [3654921.255788]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
2021-10-21 16:27:57 [3654921.264477]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
2021-10-21 16:27:57 [3654921.272769]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
2021-10-21 16:27:57 [3654921.280652]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
2021-10-21 16:27:57 [3654921.288834]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
2021-10-21 16:27:57 [3654921.297322]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
2021-10-21 16:27:57 [3654921.306178]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
2021-10-21 16:27:57 [3654921.313963]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
2021-10-21 16:27:57 [3654921.322597]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2021-10-21 16:27:57 [3654921.332123]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
2021-10-21 16:27:57 [3654921.340086]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
2021-10-21 16:27:57 [3654921.346479]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
2021-10-21 16:27:57 [3654921.354429]  [<ffffffffffffffff>] 0xffffffffffffffff

There are a bunch of these in the logs from the last few days, and the processes still exists and their stacks are still stuck in the same state as the dump.

  

The problem was noticed this morning and the full logs for that ls attempt are

2021-10-22 08:09:57 [3711440.339065] LNet: Service thread pid 21278 was inactive for 200.48s. The thread might be hung, or it might only be slow and will resume later. Dum\
ping the stack trace for debugging purposes:
2021-10-22 08:09:57 [3711440.360090] Pid: 21278, comm: mdt00_123 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021
2021-10-22 08:09:57 [3711440.372373] Call Trace:
2021-10-22 08:09:57 [3711440.376201]  [<ffffffffc1262650>] ldlm_completion_ast+0x440/0x870 [ptlrpc]
2021-10-22 08:09:57 [3711440.385068]  [<ffffffffc12638c1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
2021-10-22 08:09:57 [3711440.394166]  [<ffffffffc15f331b>] mdt_object_local_lock+0x50b/0xb20 [mdt]
2021-10-22 08:09:57 [3711440.402898]  [<ffffffffc15f39a0>] mdt_object_lock_internal+0x70/0x360 [mdt]
2021-10-22 08:09:57 [3711440.411797]  [<ffffffffc15f4371>] mdt_getattr_name_lock+0x121/0x1df0 [mdt]
2021-10-22 08:09:57 [3711440.420608]  [<ffffffffc15fc5c5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
2021-10-22 08:09:57 [3711440.429022]  [<ffffffffc15f17ea>] mdt_intent_opc+0x1ba/0xb50 [mdt]
2021-10-22 08:09:57 [3711440.437045]  [<ffffffffc15f9aa4>] mdt_intent_policy+0x1a4/0x360 [mdt]
2021-10-22 08:09:57 [3711440.445338]  [<ffffffffc1249586>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
2021-10-22 08:09:57 [3711440.453922]  [<ffffffffc1271176>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
2021-10-22 08:09:57 [3711440.462899]  [<ffffffffc12fbe72>] tgt_enqueue+0x62/0x210 [ptlrpc]
2021-10-22 08:09:57 [3711440.470790]  [<ffffffffc130266a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
2021-10-22 08:09:57 [3711440.479509]  [<ffffffffc12a55fb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2021-10-22 08:09:57 [3711440.489097]  [<ffffffffc12a99ed>] ptlrpc_main+0xc4d/0x2280 [ptlrpc]
2021-10-22 08:09:57 [3711440.497117]  [<ffffffffba8cb201>] kthread+0xd1/0xe0
2021-10-22 08:09:57 [3711440.503545]  [<ffffffffbafc3ff7>] ret_from_fork_nospec_end+0x0/0x39
2021-10-22 08:09:57 [3711440.511516]  [<ffffffffffffffff>] 0xffffffffffffffff
2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278
2021-10-22 08:11:36 [3711539.849250] LustreError: 21278:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1634915196, 300s ago); not en\
tering recovery in server code, just going back to sleep ns: mdt-lquake-MDT0003_UUID lock: ffff9f5eade79b00/0xc43735e83ac402f0 lrc: 3/1,0 mode: --/PR res: [0x4c00c7f52:0x5\
:0x0].0x0 bits 0x12/0x0 rrc: 22 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 21278 timeout: 0 lvb_type: 0
2021-10-22 08:16:31 [3711834.827788] Lustre: 21121:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2021-10-22 08:16:31 [3711834.827788]   req@ffff9f25e8a75100 x1714259348772800/t0(0) o101->98d2fd76-058a-227d-d9be-65ffdb9bb578@192.168.128.124@o2ib18:291/0 lens 576/3272 e\
 22 to 0 dl 1634915796 ref 2 fl Interpret:/0/0 rc 0/0

 

There are some similar messages on jet4 before 2021-10-20, but not very many and not as frequently. There are very few such similar messages on other machines in jet, most have none this month.

 

 



 Comments   
Comment by Oleg Drokin [ 25/Oct/21 ]

Just to confirm, when you say "After the first time, the stack trace changed to" does that mean the very first stack trace is always the one in the create->trans stop path?

Comment by Gian-Carlo Defazio [ 25/Oct/21 ]

What do you mean by create->trans stop path?

Here's me grepping for stack dumps in October

2021-10-01 02:24:35 [1876351.651648] LustreError: dumping log to /tmp/lustre-log.1633080275.21241
2021-10-01 15:40:12 [1924088.038346] LustreError: dumping log to /tmp/lustre-log.1633128012.12376
2021-10-04 14:46:28 [2180059.422390] LustreError: dumping log to /tmp/lustre-log.1633383988.31179
2021-10-05 04:12:51 [2228441.432712] LustreError: dumping log to /tmp/lustre-log.1633432371.19181
2021-10-20 12:06:19 [3552825.299378] LustreError: dumping log to /tmp/lustre-log.1634756779.21159
2021-10-21 01:33:05 [3601230.956954] LustreError: dumping log to /tmp/lustre-log.1634805185.21311
2021-10-21 01:34:45 [3601330.752850] LustreError: dumping log to /tmp/lustre-log.1634805285.21311
2021-10-21 07:47:19 [3623683.794004] LustreError: dumping log to /tmp/lustre-log.1634827639.16605
2021-10-21 08:21:17 [3625722.539271] LustreError: dumping log to /tmp/lustre-log.1634829677.16601
2021-10-21 08:58:26 [3627950.722608] LustreError: dumping log to /tmp/lustre-log.1634831906.21124
2021-10-21 09:41:07 [3630512.212157] LustreError: dumping log to /tmp/lustre-log.1634834467.31952
2021-10-21 12:45:17 [3641561.995392] LustreError: dumping log to /tmp/lustre-log.1634845517.21154
2021-10-21 12:57:34 [3642298.747487] LustreError: dumping log to /tmp/lustre-log.1634846254.21276
2021-10-21 14:34:54 [3648138.512019] LustreError: dumping log to /tmp/lustre-log.1634852094.21240
2021-10-21 16:27:57 [3654921.360927] LustreError: dumping log to /tmp/lustre-log.1634858877.6006
2021-10-22 08:09:57 [3711440.518045] LustreError: dumping log to /tmp/lustre-log.1634915397.21278
2021-10-22 08:39:53 [3713236.067411] LustreError: dumping log to /tmp/lustre-log.1634917192.16602
2021-10-22 09:18:54 [3715577.400409] LustreError: dumping log to /tmp/lustre-log.1634919534.21261

By first stack trace I mean the one the occurs at 2021-10-20 12:06:19 and is the first stack trace shown in the description.

Starting 2021-10-21 01:33:05, that stack trace, and all subsequent traces look like the second stack trace show in the description.

Comment by Oleg Drokin [ 27/Oct/21 ]

So can we assume you have a way to reliably hit this? Would it be feasible to increase debug levels on the client and servers just before this first access happens and then dump lustre debug logs when complete?

It's strange that there's no evictions or anything in the logs and the lock being held is a server side lock.

The extra debug levels we want are: rpctrace dlmtrace ha vfstrace

 

Do increase debug_mb to 1000 as well please and then once done dump the log with lctl dk. Ideally you'll do this on all MDS and OST servers and the client that initiated that first request.

Comment by Gian-Carlo Defazio [ 27/Oct/21 ]

For my records, the local ticket is TOSS-5346

Comment by Gian-Carlo Defazio [ 28/Oct/21 ]

We reran the tests that we believe resulted in the first error, but were unable to reproduce the error.

The differences were that the debug levels rpctrace dlmtrace ha vfstrace were set, and that one node that had failed over had been restored. The node that had failed over did so over a month ago, well before 2021-10-20 when this issue seems to have started.

Generated at Sat Feb 10 03:15:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.