[LU-10674] MGS very unstable in 2.10.x Created: 15/Feb/18 Updated: 02/Oct/20 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.3, Lustre 2.10.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stephane Thiell | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
3.10.0-693.2.2.el7_lustre.pl1.x86_64 |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in LU-10390). This is concerning for a stable version. MGS stuck this morning when trying to add a new OST: [669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [669740.010557] Pid: 136320, comm: ll_mgs_0011 [669740.015223] Call Trace: [669740.019798] [<ffffffff816a94e9>] schedule+0x29/0x70 [669740.025437] [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0 [669740.032077] [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc] [669740.039848] [<ffffffff81098b20>] ? process_timeout+0x0/0x10 [669740.046276] [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc] [669740.054934] [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc] [669740.062704] [<ffffffff810c4810>] ? default_wake_function+0x0/0x20 [669740.069704] [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs] [669740.077777] [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc] [669740.085451] [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs] [669740.093331] [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc] [669740.101394] [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs] [669740.109372] [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [669740.116950] [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs] [669740.123956] [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [669740.131534] [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs] [669740.139498] [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs] [669740.146608] [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] [669740.154208] [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc] [669740.161997] [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [669740.170655] [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [669740.178328] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [669740.185420] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [669740.192041] [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [669740.199133] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [669740.205350] [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0 [669740.211583] [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [669740.218675] [<ffffffff810b098f>] kthread+0xcf/0xe0 [669740.224215] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [669740.229851] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [669740.235970] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320 Clients output something like that: [1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) [1466043.295179] Lustre: Skipped 1 previous similar message [1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message [1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2 [1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail [1466657.299037] LustreError: Skipped 1 previous similar message [1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i [1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo [1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message [1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3 [1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks: [1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4 [1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message [1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5) Rebooting the MGS fixes the issue, until the next target failover/failback. Stephane |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 16/Feb/18 ] |
|
Hello Stephane! Also, about "stopping the MGS always triggers a crash (reported in LU-10390)", again we can try to reproduce in-house, but may be you have a crash-dump already available that could be analyzed as a first step? |
| Comment by John Hammond [ 16/Feb/18 ] |
|
Hi Stephane, Could you describe the interop situation here? What version is the MGS and what versions are the targets? |
| Comment by Stephane Thiell [ 16/Feb/18 ] |
|
Hi Bruno and John, For this case, all targets are running Lustre 2.10.3 RC1. Because this system running 2.10 has been quite unstable lately, we don't want to trigger new problems right now, but next time I have too failover/failback, I'll try to gather more Lustre debug logs! Thanks! |
| Comment by Gerrit Updater [ 26/May/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34958 |
| Comment by Hongchao Zhang [ 07/Aug/20 ] |
|
this issue could be caused by the same reason in |
| Comment by Aurelien Degremont (Inactive) [ 01/Oct/20 ] |
|
Hello Hongchao Zhang, I'm not sure the patch you pointed to will really fix this problem. If I understand correctly, if one client is dead and not yet evicted by MGS while an IR is happening, the problem should happen. I tested that multiple times without being able to reproduce it. Actually, I saw the MGS evicts the client, or the revoke returning an error but not staying stale. 6895:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11 |