[LU-14611] racer test 1 hangs in ls/locking Created: 13/Apr/21  Updated: 16/Mar/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7, Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: James A Simmons
Resolution: Unresolved Votes: 0
Labels: ORNL

Issue Links:
Related
is related to LU-11358 racer test 1 hangs in locking with DNE Open
is related to LU-12354 MDT threads stuck at ldlm_completion_ast Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There are a variety of racer test_1 hangs that look familiar to existing tickets, but the call traces don’t match. I’m opening this ticket to capture these racer hangs that don’t match existing tickets but all have similar call traces on the MDS. The root cause of this/these issues may be the same as existing tickets, but look different.

All of these tests have similar call traces to the following tickets:
Not LU-11751 because there are not dom locks in the traces.
Not LU-12037 because not using DNE and don’t see osp_md locks in traces
Not LU-10852 because no lod_object nor osp_md locks in traces
Similar to:
Not LU-11359 because no mdt_dom_discard_data in call trace
Not LU-11358 because not using DNE

For Lustre 2.14.51 CentOS 8.3 client/server no DNE at https://testing.whamcloud.com/test_sets/205c9572-7e2a-4a12-ac67-0e717b44ee7c

[66773.230283] Lustre: mdt00_030: service thread pid 3250671 was inactive for 63.465 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[66773.230306] Pid: 3250667, comm: mdt00_027 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021
[66773.233745] Lustre: Skipped 1 previous similar message
[66773.235770] Call Trace TBD:
[66773.236421] [<0>] ldlm_completion_ast+0x789/0x8e0 [ptlrpc]
[66773.238637] [<0>] ldlm_cli_enqueue_local+0x2f9/0x830 [ptlrpc]
[66773.239982] [<0>] mdt_object_local_lock+0x506/0xb00 [mdt]
[66773.241078] [<0>] mdt_object_lock_internal+0x183/0x430 [mdt]
[66773.242236] [<0>] mdt_getattr_name_lock+0x843/0x1a00 [mdt]
[66773.243362] [<0>] mdt_intent_getattr+0x260/0x430 [mdt]
[66773.244433] [<0>] mdt_intent_opc+0x44d/0xa80 [mdt]
[66773.245423] [<0>] mdt_intent_policy+0x1f6/0x380 [mdt]
[66773.246510] [<0>] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc]
[66773.247761] [<0>] ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc]
[66773.248962] [<0>] tgt_enqueue+0xa4/0x1f0 [ptlrpc]
[66773.250007] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[66773.251146] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[66773.252430] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
[66773.253459] [<0>] kthread+0x112/0x130
[66773.254221] [<0>] ret_from_fork+0x35/0x40

On the OSS console

[66664.337262] Pid: 3425654, comm: ll_ost_io00_007 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021
[66664.340289] Call Trace TBD:
[66664.341777] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
[66664.343472] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs]
[66664.345227] [<0>] ofd_commitrw_write+0x5bf/0x1990 [ofd]
[66664.346282] [<0>] ofd_commitrw+0x30e/0x970 [ofd]
[66664.348215] [<0>] tgt_brw_write+0x11f6/0x21b0 [ptlrpc]
[66664.349686] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
[66664.351150] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
[66664.353326] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
[66664.354606] [<0>] kthread+0x112/0x130
[66664.355857] [<0>] ret_from_fork+0x35/0x40

We see something similar for interop testing:
2.14.51.42 clients/2.13.0 servers at https://testing.whamcloud.com/test_sets/b709fcd0-f46b-4347-b768-dfc01efd3131
MDS console

[  959.417633] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline
[ 1026.377566] Lustre: mdt00_030: service thread pid 14229 was inactive for 64.034 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 1026.377589] Lustre: mdt00_018: service thread pid 14206 was inactive for 64.037 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 1026.377609] Pid: 14206, comm: mdt00_018 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
[ 1026.377610] Call Trace:
[ 1026.379313]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[ 1026.379445]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
[ 1026.379939]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
[ 1026.379991]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
[ 1026.380005]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
[ 1026.380017]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 1026.380050]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
[ 1026.380063]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
[ 1026.380116]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[ 1026.380190]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[ 1026.380676]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 1026.380773]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[ 1026.380822]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 1026.380885]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[ 1026.381081]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
[ 1026.381185]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
[ 1026.382196]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 1026.382222] Pid: 14204, comm: mdt00_016 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
[ 1026.382222] Call Trace:
[ 1026.382265]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[ 1026.382297]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
[ 1026.382311]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
[ 1026.382323]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
[ 1026.382336]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
[ 1026.382348]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 1026.382360]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
[ 1026.382372]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
[ 1026.382403]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[ 1026.382437]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[ 1026.382483]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 1026.382533]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[ 1026.382573]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 1026.382610]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[ 1026.382614]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
[ 1026.382616]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
[ 1026.382624]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 1026.382631] Pid: 14198, comm: mdt00_010 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
[ 1026.382631] Call Trace:
[ 1026.382671]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
[ 1026.382704]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
[ 1026.382717]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
[ 1026.382730]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
[ 1026.382742]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
[ 1026.382754]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
[ 1026.382767]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
[ 1026.382778]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
[ 1026.382810]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
[ 1026.382852]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
[ 1026.382893]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 1026.382935]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[ 1026.382972]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[ 1026.383031]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[ 1026.383036]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
[ 1026.383039]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
[ 1026.383045]  [<ffffffffffffffff>] 0xffffffffffffffff


 Comments   
Comment by James A Simmons [ 06/Mar/23 ]

We also have production system not using DNE or DoM that is now showing this issue.  The patch for LU-16389 was recently added to our clients so it brought this problem to the surface. I don't blame LU-16389 patch as the source since this problem has been around for a long time. Our 

at_max = 600

timeout = 100

Any suggestions to work around this?

Generated at Sat Feb 10 03:11:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.