Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.4
-
None
-
clients running l2.12.3
-
2
-
9223372036854775807
Description
MDS threads hang/slow getting stack trace dumps
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib) Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0 Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118 Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.