Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.4
-
None
-
clients running l2.12.3
-
2
-
9223372036854775807
Description
MDS threads hang/slow getting stack trace dumps
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib) Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0 Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118 Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
I checked for the first eviction and that seems to have been removed from the client log by the time it was dumped so it's not possible to see what was going on on the client at the time.
I see there's still a bunch of stuck threads after that that I need to see if I can find why - probably another AST lost but it's not a give that client log (whatever it was) survived either I guess.