Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.4
-
None
-
clients running l2.12.3
-
2
-
9223372036854775807
Description
MDS threads hang/slow getting stack trace dumps
Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377038] Lustre: MGS: Connection restored to 09397a7f-3fe2-2dc8-d25a-74d157cb2008 (at 10.151.50.72@o2ib) Jun 16 14:20:20 nbp8-mds1 kernel: [1660354.377044] Lustre: Skipped 277 previous similar messages Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (30/30), not sending early reply Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.852981] req@ffff8979f9e16300 x1668974534442592/t1531807577764(0) o36->67852282-a3ed-5acb-a9e2-3cae43fe0406@10.151.0.201@o2ib:5/0 lens 488/3152 e 0 to 0 dl 1592342570 ref 2 fl Interpret:/0/0 rc 0/0 Jun 16 14:22:20 nbp8-mds1 kernel: [1660474.952277] Lustre: 8579:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 134 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.333617] LNet: Service thread pid 14118 was inactive for 551.86s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.389992] LNet: Skipped 3 previous similar messages Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407185] Pid: 14118, comm: mdt00_115 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407190] Call Trace: Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.407202] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412473] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412496] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412506] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412516] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412530] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412541] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412552] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412562] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412571] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412623] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412659] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412691] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412696] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412699] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412722] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:14 nbp8-mds1 kernel: [1660708.412725] LustreError: dumping log to /tmp/lustre-log.1592342774.14118 Jun 16 14:26:14 nbp8-mds1 sec[2849]: SEC_EVENT |msg lustre hung thread Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.178343] LNet: Service thread pid 8575 was inactive for 552.75s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234458] Pid: 8575, comm: mdt00_043 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:15 nbp8-mds1 kernel: [1660709.234459] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.234472] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239700] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239749] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239763] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239773] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239785] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239797] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239809] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239819] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239829] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239874] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239909] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239945] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239952] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239958] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239982] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Pid: 14083, comm: mdt00_088 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.239988] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240020] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240050] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240060] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240071] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240082] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240093] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240104] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240116] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240127] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240137] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240174] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240207] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240240] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240244] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240247] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240251] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Pid: 10527, comm: mdt00_069 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240254] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240285] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240313] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240325] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240335] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240344] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240357] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240369] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240380] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240389] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240400] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240439] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240478] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240510] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240514] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240516] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240520] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Pid: 14084, comm: mdt00_089 3.10.0-1062.12.1.el7_lustre2124.x86_64 #1 SMP Tue Mar 17 13:32:19 PDT 2020 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240523] Call Trace: Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240554] [<ffffffffc0f17c90>] ldlm_completion_ast+0x430/0x860 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240585] [<ffffffffc0f187b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240598] [<ffffffffc16205cb>] mdt_object_local_lock+0x50b/0xb20 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240610] [<ffffffffc1620c50>] mdt_object_lock_internal+0x70/0x360 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240620] [<ffffffffc1620f9c>] mdt_reint_object_lock+0x2c/0x60 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240631] [<ffffffffc1639efc>] mdt_reint_striped_lock+0x8c/0x510 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240647] [<ffffffffc163d866>] mdt_reint_setattr+0x676/0x1290 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240657] [<ffffffffc163f963>] mdt_reint_rec+0x83/0x210 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240666] [<ffffffffc161c273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240677] [<ffffffffc16276e7>] mdt_reint+0x67/0x140 [mdt] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240716] [<ffffffffc0fb73ca>] tgt_request_handle+0xada/0x1570 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240749] [<ffffffffc0f5b47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240781] [<ffffffffc0f5ede4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240784] [<ffffffff858c61f1>] kthread+0xd1/0xe0 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240788] [<ffffffff85f8dd37>] ret_from_fork_nospec_end+0x0/0x39 Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240792] [<ffffffffffffffff>] 0xffffffffffffffff Jun 16 14:26:16 nbp8-mds1 kernel: [1660709.240796] LNet: Service thread pid 8568 was inactive for 552.81s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
I discussed with Amir and he thinks your situation resembles a case where there's not enough peer credits so messages are dropped.
I checked the logs and I do see this:
but the date is strange, its from July. No newer messages, but it's only output at trace level "net" that you've not been including later (I guess you were not removing older server.dk files?)
He says you need to check peer_credits_hiw and peer_credits values and peer_credits_hiw should be set to 1/2 of the peer_credits to ensure proper reclaim of credits for certain types of messages (perhaps growing peer credits might not be a bad idea depending on what your settings are now)
"lnetctl net show -v 4", "lnetctl peer show -v" should show some related stats and min credits.