Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14611

racer test 1 hangs in ls/locking

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.7, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      There are a variety of racer test_1 hangs that look familiar to existing tickets, but the call traces don’t match. I’m opening this ticket to capture these racer hangs that don’t match existing tickets but all have similar call traces on the MDS. The root cause of this/these issues may be the same as existing tickets, but look different.

      All of these tests have similar call traces to the following tickets:
      Not LU-11751 because there are not dom locks in the traces.
      Not LU-12037 because not using DNE and don’t see osp_md locks in traces
      Not LU-10852 because no lod_object nor osp_md locks in traces
      Similar to:
      Not LU-11359 because no mdt_dom_discard_data in call trace
      Not LU-11358 because not using DNE

      For Lustre 2.14.51 CentOS 8.3 client/server no DNE at https://testing.whamcloud.com/test_sets/205c9572-7e2a-4a12-ac67-0e717b44ee7c

      [66773.230283] Lustre: mdt00_030: service thread pid 3250671 was inactive for 63.465 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [66773.230306] Pid: 3250667, comm: mdt00_027 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021
      [66773.233745] Lustre: Skipped 1 previous similar message
      [66773.235770] Call Trace TBD:
      [66773.236421] [<0>] ldlm_completion_ast+0x789/0x8e0 [ptlrpc]
      [66773.238637] [<0>] ldlm_cli_enqueue_local+0x2f9/0x830 [ptlrpc]
      [66773.239982] [<0>] mdt_object_local_lock+0x506/0xb00 [mdt]
      [66773.241078] [<0>] mdt_object_lock_internal+0x183/0x430 [mdt]
      [66773.242236] [<0>] mdt_getattr_name_lock+0x843/0x1a00 [mdt]
      [66773.243362] [<0>] mdt_intent_getattr+0x260/0x430 [mdt]
      [66773.244433] [<0>] mdt_intent_opc+0x44d/0xa80 [mdt]
      [66773.245423] [<0>] mdt_intent_policy+0x1f6/0x380 [mdt]
      [66773.246510] [<0>] ldlm_lock_enqueue+0x4c1/0x9f0 [ptlrpc]
      [66773.247761] [<0>] ldlm_handle_enqueue0+0x61a/0x16d0 [ptlrpc]
      [66773.248962] [<0>] tgt_enqueue+0xa4/0x1f0 [ptlrpc]
      [66773.250007] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [66773.251146] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [66773.252430] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
      [66773.253459] [<0>] kthread+0x112/0x130
      [66773.254221] [<0>] ret_from_fork+0x35/0x40
      

      On the OSS console

      [66664.337262] Pid: 3425654, comm: ll_ost_io00_007 4.18.0-240.1.1.el8_lustre.x86_64 #1 SMP Tue Mar 23 05:51:49 UTC 2021
      [66664.340289] Call Trace TBD:
      [66664.341777] [<0>] libcfs_call_trace+0x6f/0x90 [libcfs]
      [66664.343472] [<0>] osd_trans_start+0x50c/0x530 [osd_ldiskfs]
      [66664.345227] [<0>] ofd_commitrw_write+0x5bf/0x1990 [ofd]
      [66664.346282] [<0>] ofd_commitrw+0x30e/0x970 [ofd]
      [66664.348215] [<0>] tgt_brw_write+0x11f6/0x21b0 [ptlrpc]
      [66664.349686] [<0>] tgt_request_handle+0xc78/0x1910 [ptlrpc]
      [66664.351150] [<0>] ptlrpc_server_handle_request+0x31a/0xba0 [ptlrpc]
      [66664.353326] [<0>] ptlrpc_main+0xba2/0x14a0 [ptlrpc]
      [66664.354606] [<0>] kthread+0x112/0x130
      [66664.355857] [<0>] ret_from_fork+0x35/0x40
      

      We see something similar for interop testing:
      2.14.51.42 clients/2.13.0 servers at https://testing.whamcloud.com/test_sets/b709fcd0-f46b-4347-b768-dfc01efd3131
      MDS console

      [  959.417633] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline
      [ 1026.377566] Lustre: mdt00_030: service thread pid 14229 was inactive for 64.034 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
      [ 1026.377589] Lustre: mdt00_018: service thread pid 14206 was inactive for 64.037 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 1026.377609] Pid: 14206, comm: mdt00_018 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
      [ 1026.377610] Call Trace:
      [ 1026.379313]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [ 1026.379445]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      [ 1026.379939]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
      [ 1026.379991]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [ 1026.380005]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
      [ 1026.380017]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [ 1026.380050]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
      [ 1026.380063]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      [ 1026.380116]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [ 1026.380190]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [ 1026.380676]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 1026.380773]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [ 1026.380822]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 1026.380885]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [ 1026.381081]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
      [ 1026.381185]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
      [ 1026.382196]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 1026.382222] Pid: 14204, comm: mdt00_016 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
      [ 1026.382222] Call Trace:
      [ 1026.382265]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [ 1026.382297]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      [ 1026.382311]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
      [ 1026.382323]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [ 1026.382336]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
      [ 1026.382348]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [ 1026.382360]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
      [ 1026.382372]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      [ 1026.382403]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [ 1026.382437]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [ 1026.382483]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 1026.382533]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [ 1026.382573]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 1026.382610]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [ 1026.382614]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
      [ 1026.382616]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
      [ 1026.382624]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 1026.382631] Pid: 14198, comm: mdt00_010 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Thu Dec 5 10:35:21 UTC 2019
      [ 1026.382631] Call Trace:
      [ 1026.382671]  [<ffffffffc0e20bd0>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
      [ 1026.382704]  [<ffffffffc0e21e0c>] ldlm_cli_enqueue_local+0x25c/0x850 [ptlrpc]
      [ 1026.382717]  [<ffffffffc124e833>] mdt_object_local_lock+0x523/0xb50 [mdt]
      [ 1026.382730]  [<ffffffffc124eed0>] mdt_object_lock_internal+0x70/0x360 [mdt]
      [ 1026.382742]  [<ffffffffc125084a>] mdt_getattr_name_lock+0x92a/0x1c90 [mdt]
      [ 1026.382754]  [<ffffffffc1257fe5>] mdt_intent_getattr+0x2b5/0x480 [mdt]
      [ 1026.382767]  [<ffffffffc124ccfa>] mdt_intent_opc+0x1ba/0xb40 [mdt]
      [ 1026.382778]  [<ffffffffc12554b4>] mdt_intent_policy+0x1a4/0x360 [mdt]
      [ 1026.382810]  [<ffffffffc0e07e16>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
      [ 1026.382852]  [<ffffffffc0e30476>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
      [ 1026.382893]  [<ffffffffc0eba032>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 1026.382935]  [<ffffffffc0ec282a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
      [ 1026.382972]  [<ffffffffc0e64a86>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 1026.383031]  [<ffffffffc0e685bc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [ 1026.383036]  [<ffffffffa32c50d1>] kthread+0xd1/0xe0
      [ 1026.383039]  [<ffffffffa398cd37>] ret_from_fork_nospec_end+0x0/0x39
      [ 1026.383045]  [<ffffffffffffffff>] 0xffffffffffffffff
      

      Attachments

        Issue Links

          Activity

            People

              simmonsja James A Simmons
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: