[LU-14611] racer test 1 hangs in ls/locking Created: 13/Apr/21 Updated: 16/Mar/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7, Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | James A Simmons |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ORNL | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
There are a variety of racer test_1 hangs that look familiar to existing tickets, but the call traces don’t match. I’m opening this ticket to capture these racer hangs that don’t match existing tickets but all have similar call traces on the MDS. The root cause of this/these issues may be the same as existing tickets, but look different. All of these tests have similar call traces to the following tickets: For Lustre 2.14.51 CentOS 8.3 client/server no DNE at https://testing.whamcloud.com/test_sets/205c9572-7e2a-4a12-ac67-0e717b44ee7c [66773.230283] Lustre: mdt00_030: service thread pid 3250671 was inactive for 63.465 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [66773.230306] Pid: 3250667, comm: mdt00_027 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021 [66773.233745] Lustre: Skipped 1 previous similar message [66773.235770] Call Trace TBD: [66773.236421] [<0>] ldlm_completion_ast+0x789/0x8e0 [ptlrpc] [66773.238637] [<0>] ldlm_cli_enqueue_local+0x2f9/0x830 [ptlrpc] [66773.239982] [<0>] mdt_object_local_lock+0x506/0xb00 [mdt] [66773.241078] [<0>] mdt_object_lock_internal+0x183/0x430 [mdt] [66773.242236] [<0>] mdt_getattr_name_lock+0x843/0x1a00 [mdt] [66773.243362] [<0>] mdt_intent_getattr+0x260/0x430 [mdt] [66773.244433] [<0>] mdt_intent_opc+0x44d/0xa80 [mdt] [66773.245423] [<0>] mdt_intent_policy+0x1f6/0x380 [mdt] [66773.246510] [<0>] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc] [66773.247761] [<0>] ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc] [66773.248962] [<0>] tgt_enqueue+0xa4/0x1f0 [ptlrpc] [66773.250007] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc] [66773.251146] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [66773.252430] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc] [66773.253459] [<0>] kthread+0x112/0x130 [66773.254221] [<0>] ret_from_fork+0x35/0x40 On the OSS console [66664.337262] Pid: 3425654, comm: ll_ost_io00_007 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021 [66664.340289] Call Trace TBD: [66664.341777] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs] [66664.343472] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs] [66664.345227] [<0>] ofd_commitrw_write+0x5bf/0x1990 [ofd] [66664.346282] [<0>] ofd_commitrw+0x30e/0x970 [ofd] [66664.348215] [<0>] tgt_brw_write+0x11f6/0x21b0 [ptlrpc] [66664.349686] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc] [66664.351150] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc] [66664.353326] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc] [66664.354606] [<0>] kthread+0x112/0x130 [66664.355857] [<0>] ret_from_fork+0x35/0x40 We see something similar for interop testing: [ 959.417633] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline [ 1026.377566] Lustre: mdt00_030: service thread pid 14229 was inactive for 64.034 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 1026.377589] Lustre: mdt00_018: service thread pid 14206 was inactive for 64.037 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 1026.377609] Pid: 14206, comm: mdt00_018 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.377610] Call Trace: [ 1026.379313] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.379445] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.379939] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.379991] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.380005] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.380017] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.380050] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.380063] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.380116] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.380190] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.380676] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.380773] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.380822] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.380885] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.381081] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.381185] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.382196] [<ffffffffffffffff>] 0xffffffffffffffff [ 1026.382222] Pid: 14204, comm: mdt00_016 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.382222] Call Trace: [ 1026.382265] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.382297] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.382311] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.382323] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.382336] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.382348] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.382360] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.382372] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.382403] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.382437] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.382483] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.382533] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.382573] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.382610] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.382614] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.382616] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.382624] [<ffffffffffffffff>] 0xffffffffffffffff [ 1026.382631] Pid: 14198, comm: mdt00_010 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019 [ 1026.382631] Call Trace: [ 1026.382671] [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 1026.382704] [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc] [ 1026.382717] [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt] [ 1026.382730] [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt] [ 1026.382742] [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt] [ 1026.382754] [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt] [ 1026.382767] [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt] [ 1026.382778] [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt] [ 1026.382810] [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc] [ 1026.382852] [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc] [ 1026.382893] [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc] [ 1026.382935] [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 1026.382972] [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 1026.383031] [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 1026.383036] [<ffffffffa32c50d1>] kthread+0xd1/0xe0 [ 1026.383039] [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39 [ 1026.383045] [<ffffffffffffffff>] 0xffffffffffffffff |
| Comments |
| Comment by James A Simmons [ 06/Mar/23 ] |
|
We also have production system not using DNE or DoM that is now showing this issue. The patch for LU-16389 was recently added to our clients so it brought this problem to the surface. I don't blame LU-16389 patch as the source since this problem has been around for a long time. Our at_max = 600 timeout = 100 Any suggestions to work around this? |