Details
-
Bug
-
Resolution: Won't Fix
-
Minor
-
None
-
Lustre 2.12.7, Lustre 2.15.0
-
3
-
9223372036854775807
Description
There are a variety of racer test_1 hangs that look familiar to existing tickets, but the call traces don’t match. I’m opening this ticket to capture these racer hangs that don’t match existing tickets but all have similar call traces on the MDS. The root cause of this/these issues may be the same as existing tickets, but look different.
All of these tests have similar call traces to the following tickets:
Not LU-11751 because there are not dom locks in the traces.
Not LU-12037 because not using DNE and don’t see osp_md locks in traces
Not LU-10852 because no lod_object nor osp_md locks in traces
Similar to:
Not LU-11359 because no mdt_dom_discard_data in call trace
Not LU-11358 because not using DNE
For Lustre 2.14.51 CentOS 8.3 client/server no DNE at https://testing.whamcloud.com/test_sets/205c9572-7e2a-4a12-ac67-0e717b44ee7c
[66773.230283] Lustre: mdt00_030: service thread pid 3250671 was inactive for 63.465 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [66773.230306] Pid: 3250667, comm: mdt00_027 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021 [66773.233745] Lustre: Skipped 1 previous similar message [66773.235770] Call Trace TBD: [66773.236421] [<0>] ldlm_completion_ast+0x789/0x8e0 [ptlrpc] [66773.238637] [<0>] ldlm_cli_enqueue_local+0x2f9/0x830 [ptlrpc] [66773.239982] [<0>] mdt_object_local_lock+0x506/0xb00 [mdt] [66773.241078] [<0>] mdt_object_lock_internal+0x183/0x430 [mdt] [66773.242236] [<0>] mdt_getattr_name_lock+0x843/0x1a00 [mdt] [66773.243362] [<0>] mdt_intent_getattr+0x260/0x430 [mdt] [66773.244433] [<0>] mdt_intent_opc+0x44d/0xa80 [mdt] [66773.245423] [<0>] mdt_intent_policy+0x1f6/0x380 [mdt] [66773.246510] [<0>] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc] [66773.247761] [<0>] ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc] [66773.248962] [<0>] tgt_enqueue+0xa4/0x1f0 [ptlrpc] [66773.250007] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc] [66773.251146] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [66773.252430] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc] [66773.253459] [<0>] kthread+0x112/0x130 [66773.254221] [<0>] ret_from_fork+0x35/0x40
On the OSS console
[66664.337262] Pid: 3425654, comm: ll_ost_io00_007 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021 [66664.340289] Call Trace TBD: [66664.341777] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [66664.343472] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs] [66664.345227] [<0>] ofd_commitrw_write+0x5bf/0x1990 [ofd] [66664.346282] [<0>] ofd_commitrw+0x30e/0x970 [ofd] [66664.348215] [<0>] tgt_brw_write+0x11f6/0x21b0 [ptlrpc] [66664.349686] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc] [66664.351150] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [66664.353326] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc] [66664.354606] [<0>] kthread+0x112/0x130 [66664.355857] [<0>] ret_from_fork+0x35/0x40
We see something similar for interop testing:
2.14.51.42 clients/2.13.0 servers at https://testing.whamcloud.com/test_sets/b709fcd0-f46b-4347-b768-dfc01efd3131
MDS console
[ 959.417633] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline [ 1026.377566] Lustre: mdt00_030: service thread pid 14229 was inactive for 64.034 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 1026.377589] Lustre: mdt00_018: service thread pid 14206 was inactive for 64.037 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1026.377609] Pid: 14206, comm: mdt00_018 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.377610] Call Trace: [ 1026.379313] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.379445] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.379939] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.379991] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.380005] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.380017] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.380050] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.380063] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.380116] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.380190] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.380676] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.380773] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.380822] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.380885] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.381081] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.381185] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.382196] [<ffffffffffffffff>] 0xffffffffffffffff [ 1026.382222] Pid: 14204, comm: mdt00_016 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.382222] Call Trace: [ 1026.382265] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.382297] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.382311] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.382323] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.382336] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.382348] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.382360] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.382372] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.382403] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.382437] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.382483] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.382533] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.382573] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.382610] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.382614] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.382616] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.382624] [<ffffffffffffffff>] 0xffffffffffffffff [ 1026.382631] Pid: 14198, comm: mdt00_010 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.382631] Call Trace: [ 1026.382671] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.382704] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.382717] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.382730] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.382742] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.382754] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.382767] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.382778] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.382810] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.382852] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.382893] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.382935] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.382972] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.383031] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.383036] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.383039] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.383045] [<ffffffffffffffff>] 0xffffffffffffffff