Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.10.3, Lustre 2.10.4
-
None
-
3.10.0-693.2.2.el7_lustre.pl1.x86_64
-
3
-
9223372036854775807
Description
We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in LU-10390). This is concerning for a stable version.
MGS stuck this morning when trying to add a new OST:
[669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [669740.010557] Pid: 136320, comm: ll_mgs_0011 [669740.015223] Call Trace: [669740.019798] [<ffffffff816a94e9>] schedule+0x29/0x70 [669740.025437] [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0 [669740.032077] [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc] [669740.039848] [<ffffffff81098b20>] ? process_timeout+0x0/0x10 [669740.046276] [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc] [669740.054934] [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [669740.062704] [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 [669740.069704] [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs] [669740.077777] [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc] [669740.085451] [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs] [669740.093331] [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc] [669740.101394] [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs] [669740.109372] [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [669740.116950] [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs] [669740.123956] [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [669740.131534] [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs] [669740.139498] [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs] [669740.146608] [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [669740.154208] [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc] [669740.161997] [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [669740.170655] [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [669740.178328] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [669740.185420] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [669740.192041] [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [669740.199133] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [669740.205350] [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0 [669740.211583] [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [669740.218675] [<ffffffff810b098f>] kthread+0xcf/0xe0 [669740.224215] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [669740.229851] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [669740.235970] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320
Clients output something like that:
[1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [1466043.295179] Lustre: Skipped 1 previous similar message [1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2 [1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [1466657.299037] LustreError: Skipped 1 previous similar message [1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i [1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo [1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3 [1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 [1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
Rebooting the MGS fixes the issue, until the next target failover/failback.
Stephane