Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.12.0, Lustre 2.12.6
-
3
-
9223372036854775807
Description
racer test_1 hangs. Looking at the logs at https://testing.hpdd.intel.com/test_sets/49a29a5e-2f2f-11e8-9e0e-52540065bddc, the last lines in the suite_log are
Waited 950, rc=3 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ./file_exec.sh: fork: retry: No child processes ./file_exec.sh: fork: retry: No child processes ./file_exec.sh: fork: retry: No child processes ./file_exec.sh: fork: retry: No child processes ./file_exec.sh: fork: retry: No child processes ./file_exec.sh: fork: retry: No child processes ./
On the MDS, for MDT1 and MDT3, console logs, we see
[37477.133999] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-8vm1,trevis-8vm2 DURATION=900 ============================== 22:31:21 (1521783081) [37538.184449] Lustre: lustre-MDT0001: Client ea0330a2-74e9-672c-10d5-7c42a043d466 (at 10.9.4.82@tcp) reconnecting [37538.184457] Lustre: Skipped 1 previous similar message [37538.184519] Lustre: lustre-MDT0001: Connection restored to db24b367-e4c2-8934-3f7e-5f07540bffb1 (at 10.9.4.82@tcp) [37538.184520] Lustre: Skipped 48 previous similar messages [37568.843505] Lustre: lustre-MDT0003: Client 6439aead-19f9-dcaa-af11-fb13e38eca3a (at 10.9.4.81@tcp) reconnecting [37589.728065] LNet: Service thread pid 14633 was inactive for 62.10s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [37589.728073] Pid: 14633, comm: mdt00_004 [37589.728073] [37589.728073] Call Trace: [37589.757797] [<ffffffff81612a95>] schedule+0x35/0x80 [37589.757802] [<ffffffff81615851>] schedule_timeout+0x161/0x2d0 [37589.772170] [<ffffffffa0c76e3e>] ldlm_completion_ast+0x51e/0x860 [ptlrpc] [37589.772254] [<ffffffffa0c78e15>] ldlm_cli_enqueue_fini+0x905/0xd80 [ptlrpc] [37589.772290] [<ffffffffa0c7b119>] ldlm_cli_enqueue+0x3e9/0x8c0 [ptlrpc] [37589.772343] [<ffffffffa12694a2>] osp_md_object_lock+0x142/0x290 [osp] [37589.772411] [<ffffffffa0ea58fa>] lod_object_lock+0xda/0x910 [lod] [37589.772498] [<ffffffffa1125d3d>] mdt_remote_object_lock_try+0x1dd/0x6e0 [mdt] [37589.772519] [<ffffffffa112750b>] mdt_object_lock_internal+0x15b/0x2f0 [mdt] [37589.772534] [<ffffffffa11276bb>] mdt_object_lock+0x1b/0x20 [mdt] [37589.772549] [<ffffffffa1172a91>] mdt_lock_objects_in_linkea+0x711/0x9f6 [mdt] [37589.772568] [<ffffffffa113be6c>] mdt_reint_migrate_internal.isra.39+0x85c/0x1670 [mdt] [37589.772583] [<ffffffffa113ce85>] mdt_reint_rename_or_migrate.isra.40+0x205/0x7c0 [mdt] [37589.772597] [<ffffffffa1141351>] mdt_reint_rec+0x71/0x1e0 [mdt] [37589.772610] [<ffffffffa112259b>] mdt_reint_internal+0x59b/0x960 [mdt] [37589.772629] [<ffffffffa112cd31>] mdt_reint+0x61/0x120 [mdt] [37589.772673] [<ffffffffa0d0a44a>] tgt_request_handle+0x8ca/0x1330 [ptlrpc] [37589.772722] [<ffffffffa0cb38f8>] ptlrpc_server_handle_request+0x258/0xa60 [ptlrpc] [37589.772771] [<ffffffffa0cb7780>] ptlrpc_main+0xa40/0x1da0 [ptlrpc] [37589.773283] [<ffffffff8109e8b9>] kthread+0xc9/0xe0 [37589.777689] [<ffffffff81616f45>] ret_from_fork+0x55/0x80 [37589.806205] (null)DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [37589.806205] [37589.806206] (null)Leftover inexact backtrace: [37589.806206] [37589.806215] [<ffffffff8109e7f0>] ? kthread+0x0/0xe0 [37589.806215] [37589.806221] LustreError: dumping log to /tmp/lustre-log.1521783194.14633 [37590.051214] Pid: 14662, comm: mdt00_027 [37590.051221] [37590.051221] Call Trace: [37590.051237] [<ffffffff81612a95>] schedule+0x35/0x80 [37590.051242] [<ffffffff81615851>] schedule_timeout+0x161/0x2d0 [37590.051464] [<ffffffffa0c76e3e>] ldlm_completion_ast+0x51e/0x860 [ptlrpc] [37590.051541] [<ffffffffa0c78e15>] ldlm_cli_enqueue_fini+0x905/0xd80 [ptlrpc] [37590.051597] [<ffffffffa0c7b119>] ldlm_cli_enqueue+0x3e9/0x8c0 [ptlrpc] [37590.051621] [<ffffffffa12694a2>] osp_md_object_lock+0x142/0x290 [osp] [37590.051654] [<ffffffffa0ea58fa>] lod_object_lock+0xda/0x910 [lod] [37590.051691] [<ffffffffa1125d3d>] mdt_remote_object_lock_try+0x1dd/0x6e0 [mdt] [37590.051714] [<ffffffffa1126269>] mdt_remote_object_lock+0x29/0x30 [mdt] [37590.051736] [<ffffffffa113cf75>] mdt_reint_rename_or_migrate.isra.40+0x2f5/0x7c0 [mdt] [37590.051766] [<ffffffffa1141351>] mdt_reint_rec+0x71/0x1e0 [mdt] [37590.051787] [<ffffffffa112259b>] mdt_reint_internal+0x59b/0x960 [mdt] [37590.051817] [<ffffffffa112cd31>] mdt_reint+0x61/0x120 [mdt] [37590.051887] [<ffffffffa0d0a44a>] tgt_request_handle+0x8ca/0x1330 [ptlrpc] [37590.051954] [<ffffffffa0cb38f8>] ptlrpc_server_handle_request+0x258/0xa60 [ptlrpc] [37590.052044] [<ffffffffa0cb7780>] ptlrpc_main+0xa40/0x1da0 [ptlrpc] [37590.052054] [<ffffffff8109e8b9>] kthread+0xc9/0xe0 [37590.052060] [<ffffffff81616f45>] ret_from_fork+0x55/0x80 [37590.055121] (null)DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [37590.055121]
On the console lgs for the other MDS, with MDT0 and MDT2, we see
[37506.637291] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-8vm1,trevis-8vm2 DURATION=900 ============================== 22:31:21 (1521783081) [37514.969720] Lustre: lustre-MDT0002: Client 92f575f3-55a8-8c4a-7087-ea3eadf4f6ae (at 10.9.4.81@tcp) reconnecting [37518.255252] Lustre: lustre-MDT0002: Client b6648c6a-1d39-4917-499d-0676dd06ca4a (at 10.9.4.82@tcp) reconnecting [37518.255258] Lustre: Skipped 1 previous similar message [37519.663174] Lustre: lustre-MDT0000: Client b6648c6a-1d39-4917-499d-0676dd06ca4a (at 10.9.4.82@tcp) reconnecting [37519.957482] LustreError: 20792:0:(mdt_open.c:1204:mdt_cross_open()) lustre-MDT0002: [0x280000405:0x74:0x0] doesn't exist!: rc = -14 [37562.012928] Lustre: lustre-MDT0000: Client 92f575f3-55a8-8c4a-7087-ea3eadf4f6ae (at 10.9.4.81@tcp) reconnecting [37573.001921] Lustre: lustre-MDT0000: Client ea0330a2-74e9-672c-10d5-7c42a043d466 (at 10.9.4.82@tcp) reconnecting [37573.001924] Lustre: Skipped 3 previous similar messages [37619.168062] LNet: Service thread pid 20790 was inactive for 62.01s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [37619.168072] Pid: 20790, comm: mdt00_024 [37619.168072] [37619.168072] Call Trace: [37619.190751] [<ffffffff81612a95>] schedule+0x35/0x80 [37619.190757] [<ffffffff81615851>] schedule_timeout+0x161/0x2d0 [37619.191048] [<ffffffffa0977e3e>] ldlm_completion_ast+0x51e/0x860 [ptlrpc] [37619.191133] [<ffffffffa0978f31>] ldlm_cli_enqueue_local+0x251/0x830 [ptlrpc] [37619.191228] [<ffffffffa111a26e>] mdt_reint_rename_or_migrate.isra.40+0x5ee/0x7c0 [mdt] [37619.191261] [<ffffffffa111e351>] mdt_reint_rec+0x71/0x1e0 [mdt] [37619.191274] [<ffffffffa10ff59b>] mdt_reint_internal+0x59b/0x960 [mdt] [37619.191299] [<ffffffffa1109d31>] mdt_reint+0x61/0x120 [mdt] [37619.191344] [<ffffffffa0a0b44a>] tgt_request_handle+0x8ca/0x1330 [ptlrpc] [37619.191393] [<ffffffffa09b48f8>] ptlrpc_server_handle_request+0x258/0xa60 [ptlrpc] [37619.191436] [<ffffffffa09b8780>] ptlrpc_main+0xa40/0x1da0 [ptlrpc] [37619.191453] [<ffffffff8109e8b9>] kthread+0xc9/0xe0 [37619.191468] [<ffffffff81616f45>] ret_from_fork+0x55/0x80 [37619.194162] (null)DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [37619.194163] [37619.194163] (null)Leftover inexact backtrace: [37619.194163] [37619.194169] [<ffffffff8109e7f0>] ? kthread+0x0/0xe0 [37619.194169] [37619.194176] LustreError: dumping log to /tmp/lustre-log.1521783194.20790 [37619.251541] Pid: 19230, comm: mdt00_001 [37619.251547] [37619.251547] Call Trace: [37619.251562] [<ffffffff81612a95>] schedule+0x35/0x80 [37619.251566] [<ffffffff81615851>] schedule_timeout+0x161/0x2d0 [37619.251643] [<ffffffffa0977e3e>] ldlm_completion_ast+0x51e/0x860 [ptlrpc] [37619.251691] [<ffffffffa0979e15>] ldlm_cli_enqueue_fini+0x905/0xd80 [ptlrpc] [37619.251727] [<ffffffffa097c119>] ldlm_cli_enqueue+0x3e9/0x8c0 [ptlrpc] [37619.251767] [<ffffffffa12f34a2>] osp_md_object_lock+0x142/0x290 [osp] [37619.251835] [<ffffffffa120a8fa>] lod_object_lock+0xda/0x910 [lod] [37619.251871] [<ffffffffa1102d3d>] mdt_remote_object_lock_try+0x1dd/0x6e0 [mdt] [37619.251885] [<ffffffffa1103269>] mdt_remote_object_lock+0x29/0x30 [mdt] [37619.251898] [<ffffffffa1119f75>] mdt_reint_rename_or_migrate.isra.40+0x2f5/0x7c0 [mdt] [37619.251913] [<ffffffffa111e351>] mdt_reint_rec+0x71/0x1e0 [mdt] [37619.251926] [<ffffffffa10ff59b>] mdt_reint_internal+0x59b/0x960 [mdt] [37619.251945] [<ffffffffa1109d31>] mdt_reint+0x61/0x120 [mdt] [37619.251990] [<ffffffffa0a0b44a>] tgt_request_handle+0x8ca/0x1330 [ptlrpc] [37619.252048] [<ffffffffa09b48f8>] ptlrpc_server_handle_request+0x258/0xa60 [ptlrpc] [37619.252088] [<ffffffffa09b8780>] ptlrpc_main+0xa40/0x1da0 [ptlrpc] [37619.252095] [<ffffffff8109e8b9>] kthread+0xc9/0xe0 [37619.252099] [<ffffffff81616f45>] ret_from_fork+0x55/0x80 [37619.254330] (null)DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [37619.254331] [37619.254331] (null)Leftover inexact backtrace: [37619.254331] [37619.254334] [<ffffffff8109e7f0>] ? kthread+0x0/0xe0
We have several ls, rm and mv processes in D state on client1:
[41311.155160] ls D ffff880037d7a800 0 16699 1 0x00000004 [41311.155162] ffff8800488afc08 ffff88007a30c0d0 ffff88003f584ec0 ffff8800488b0000 [41311.155163] ffff88007a30c17c ffff88003f584ec0 00000000ffffffff ffff88007a30c180 [41311.155176] ffff8800488afc20 ffffffff81610a95 ffff88007a30c178 ffff8800488afc30 [41311.155177] Call Trace: [41311.155181] [<ffffffff81610a95>] schedule+0x35/0x80 [41311.155184] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41311.155186] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41311.155189] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41311.155193] [<ffffffff81216760>] walk_component+0x170/0x240 [41311.155197] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41311.155199] [<ffffffff812174e2>] path_openat+0x92/0x12e0 [41311.155202] [<ffffffff8121a25e>] do_filp_open+0x7e/0xd0 [41311.155205] [<ffffffff81209f99>] do_sys_open+0x129/0x200 [41311.155209] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.157343] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.157343] … [41311.191802] Leftover inexact backtrace: [41311.191802] [41311.191806] rm D ffff88007fd16440 0 19246 1 0x00000004 [41311.191808] ffff880063a5bc08 ffff88007a30c0d0 ffff880062f74e40 ffff880063a5c000 [41311.191809] ffff88007a30c17c ffff880062f74e40 00000000ffffffff ffff88007a30c180 [41311.191810] ffff880063a5bc20 ffffffff81610a95 ffff88007a30c178 ffff880063a5bc30 [41311.191811] Call Trace: [41311.191814] [<ffffffff81610a95>] schedule+0x35/0x80 [41311.191817] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41311.191819] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41311.191821] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41311.191823] [<ffffffff81216760>] walk_component+0x170/0x240 [41311.191826] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41311.191829] [<ffffffff812174e2>] path_openat+0x92/0x12e0 [41311.191831] [<ffffffff8121a25e>] do_filp_open+0x7e/0xd0 [41311.191833] [<ffffffff81209f99>] do_sys_open+0x129/0x200 [41311.191836] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.193957] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.193957] [41311.193958] Leftover inexact backtrace: [41311.193958] [41311.193962] mv D ffff880063531040 0 19385 1 0x00000004 [41311.193964] ffff88005f1a3c28 ffff88007a30c0d0 ffff8800631091c0 ffff88005f1a4000 [41311.193965] ffff88007a30c17c ffff8800631091c0 00000000ffffffff ffff88007a30c180 [41311.193966] ffff88005f1a3c40 ffffffff81610a95 ffff88007a30c178 ffff88005f1a3c50 [41311.193967] Call Trace: [41311.193970] [<ffffffff81610a95>] schedule+0x35/0x80 [41311.193973] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41311.193975] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41311.193977] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41311.193980] [<ffffffff81216760>] walk_component+0x170/0x240 [41311.193983] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41311.193985] [<ffffffff8121713f>] path_lookupat+0x7f/0x110 [41311.193987] [<ffffffff81219816>] filename_lookup+0x116/0x150 [41311.193991] [<ffffffff8120f7b9>] vfs_fstatat+0x49/0x90 [41311.193994] [<ffffffff8120fc7d>] SYSC_newlstat+0x1d/0x40 [41311.193996] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.196113] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41311.196114]
On the other client ls, ln, getfattr in D state:
[41317.513265] Leftover inexact backtrace: [41317.513265] [41317.513269] ls D ffff88005ae9c800 0 6830 1 0x00000004 [41317.513270] ffff880054e5bc08 ffff880052367bd0 ffff88006426c940 ffff880054e5c000 [41317.513272] ffff880052367c7c ffff88006426c940 00000000ffffffff ffff880052367c80 [41317.513273] ffff880054e5bc20 ffffffff81610a95 ffff880052367c78 ffff880054e5bc30 [41317.513273] Call Trace: [41317.513277] [<ffffffff81610a95>] schedule+0x35/0x80 [41317.513279] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41317.513281] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41317.513284] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41317.513286] [<ffffffff81216760>] walk_component+0x170/0x240 [41317.513289] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41317.513291] [<ffffffff812174e2>] path_openat+0x92/0x12e0 [41317.513294] [<ffffffff8121a2a8>] do_filp_open+0xc8/0xd0 [41317.513296] [<ffffffff81209f99>] do_sys_open+0x129/0x200 [41317.513299] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.515417] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.515417] [41317.515417] Leftover inexact backtrace: [41317.515417] [41317.515421] getfattr D ffff880060adb800 0 7898 1 0x00000004 [41317.515422] ffff88005bc17c28 ffff880052367bd0 ffff88005fb45200 ffff88005bc18000 [41317.515424] ffff880052367c7c ffff88005fb45200 00000000ffffffff ffff880052367c80 [41317.515425] ffff88005bc17c40 ffffffff81610a95 ffff880052367c78 ffff88005bc17c50 [41317.515425] Call Trace: [41317.515429] [<ffffffff81610a95>] schedule+0x35/0x80 [41317.515431] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41317.515433] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41317.515436] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41317.515438] [<ffffffff81216760>] walk_component+0x170/0x240 [41317.515441] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41317.515443] [<ffffffff8121713f>] path_lookupat+0x7f/0x110 [41317.515446] [<ffffffff81219816>] filename_lookup+0x116/0x150 [41317.515449] [<ffffffff8120f7b9>] vfs_fstatat+0x49/0x90 [41317.515452] [<ffffffff8120fc3a>] SYSC_newstat+0x1a/0x40 [41317.515454] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.517582] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.517582] [41317.517582] Leftover inexact backtrace: [41317.517582] [41317.517586] ln D 0000000000000001 0 8542 1 0x00000004 [41317.517587] ffff88005b603c68 ffff880052367bd0 ffff88005f270040 ffff88005b604000 [41317.517588] ffff880052367c7c ffff88005f270040 00000000ffffffff ffff880052367c80 [41317.517589] ffff88005b603c80 ffffffff81610a95 ffff880052367c78 ffff88005b603c90 [41317.517590] Call Trace: [41317.517593] [<ffffffff81610a95>] schedule+0x35/0x80 [41317.517596] [<ffffffff81610d7e>] schedule_preempt_disabled+0xe/0x10 [41317.517598] [<ffffffff816125b5>] __mutex_lock_slowpath+0x95/0x110 [41317.517600] [<ffffffff81612643>] mutex_lock+0x13/0x22 [41317.517602] [<ffffffff81216760>] walk_component+0x170/0x240 [41317.517606] [<ffffffff81216c41>] link_path_walk+0x171/0x570 [41317.517610] [<ffffffff8121713f>] path_lookupat+0x7f/0x110 [41317.517612] [<ffffffff8121979c>] filename_lookup+0x9c/0x150 [41317.517615] [<ffffffff8121ac38>] SyS_linkat+0x68/0x2e0 [41317.517617] [<ffffffff81614b0a>] entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.519736] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb6 [41317.519737] [41317.519737] Leftover inexact backtrace: [41317.519737] [41317.519740] ldlm_bl_05 S 0000000000000001 0 8829 2 0x00000000 [41317.519742] ffff880058353e18 ffff88005a8b9418 ffff8800576758c0 ffff880058354000 [41317.519743] ffff88005731d6a4 0000000000000000 ffff8800576758c0 ffff88005731d668 [41317.519744] ffff880058353e30 ffffffff81610a95 ffff88005731d640 ffff8800576758c0 [41317.519744] Call Trace: [41317.519748] [<ffffffff81610a95>] schedule+0x35/0x80 [41317.519770] [<ffffffffa0b4a510>] ldlm_bl_thread_main+0x3f0/0x690 [ptlrpc] [41317.519774] [<ffffffff8109e8f9>] kthread+0xc9/0xe0 [41317.519777] [<ffffffff81614f45>] ret_from_fork+0x55/0x80 [41317.521908] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80 [41317.521908]
Logs for this hang are at:
https://testing.hpdd.intel.com/test_sets/49a29a5e-2f2f-11e8-9e0e-52540065bddc
https://testing.hpdd.intel.com/test_sets/97cda774-274a-11e8-9e0e-52540065bddc
https://testing.hpdd.intel.com/test_sets/9efd8706-276a-11e8-9e0e-52540065bddc
https://testing.hpdd.intel.com/test_sets/444c303c-2e36-11e8-9e0e-52540065bddc