[LU-10674] MGS very unstable in 2.10.x Created: 15/Feb/18  Updated: 02/Oct/20

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3, Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Stephane Thiell Assignee: Hongchao Zhang
Resolution: Unresolved Votes: 0
Labels: None
Environment:

3.10.0-693.2.2.el7_lustre.pl1.x86_64


Issue Links:
Related
is related to LU-13356 lctl conf_param hung on the MGS node Resolved
is related to LU-13719 lov tgt 36 not cleaned! deathrow=0, l... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in LU-10390). This is concerning for a stable version.

MGS stuck this morning when trying to add a new OST:

[669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[669740.010557] Pid: 136320, comm: ll_mgs_0011
[669740.015223] 
Call Trace:
[669740.019798]  [<ffffffff816a94e9>] schedule+0x29/0x70
[669740.025437]  [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0
[669740.032077]  [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc]
[669740.039848]  [<ffffffff81098b20>] ? process_timeout+0x0/0x10
[669740.046276]  [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
[669740.054934]  [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[669740.062704]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
[669740.069704]  [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
[669740.077777]  [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc]
[669740.085451]  [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs]
[669740.093331]  [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
[669740.101394]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
[669740.109372]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[669740.116950]  [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs]
[669740.123956]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[669740.131534]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
[669740.139498]  [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs]
[669740.146608]  [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[669740.154208]  [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[669740.161997]  [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[669740.170655]  [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[669740.178328]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[669740.185420]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[669740.192041]  [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[669740.199133]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[669740.205350]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
[669740.211583]  [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[669740.218675]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[669740.224215]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[669740.229851]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[669740.235970]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0

[669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320

Clients output something like that:

[1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
[1466043.295179] Lustre: Skipped 1 previous similar message
[1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
[1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2
[1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
[1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
[1466657.299037] LustreError: Skipped 1 previous similar message
[1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i
[1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo
[1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
[1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3
[1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
[1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4
[1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
[1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)

Rebooting the MGS fixes the issue, until the next target failover/failback.

Stephane
 



 Comments   
Comment by Bruno Faccini (Inactive) [ 16/Feb/18 ]

Hello Stephane!
We can try to reproduce in-house, but as you appear to be able to reproduce easily may be you could provide with a full Lustre debug log from all sides when causing a target failover/failback?

Also, about "stopping the MGS always triggers a crash (reported in LU-10390)", again we can try to reproduce in-house, but may be you have a crash-dump already available that could be analyzed as a first step?

Comment by John Hammond [ 16/Feb/18 ]

Hi Stephane,

Could you describe the interop situation here? What version is the MGS and what versions are the targets?

Comment by Stephane Thiell [ 16/Feb/18 ]

Hi Bruno and John,

For this case, all targets are running Lustre 2.10.3 RC1.

Because this system running 2.10 has been quite unstable lately, we don't want to trigger new problems right now, but next time I have too failover/failback, I'll try to gather more Lustre debug logs! Thanks!

Comment by Gerrit Updater [ 26/May/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34958
Subject: LU-10674 ldlm: only check granted plain locks
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bd8a05bde10b4e8a22f941522e06b033c1c88a68

Comment by Hongchao Zhang [ 07/Aug/20 ]

this issue could be caused by the same reason in LU-13356 (the patch is https://review.whamcloud.com/37880)

Comment by Aurelien Degremont (Inactive) [ 01/Oct/20 ]

Hello Hongchao Zhang,

I'm not sure the patch you pointed to will really fix this problem.

If I understand correctly, if one client is dead and not yet evicted by MGS while an IR is happening, the problem should happen. I tested that multiple times without being able to reproduce it. Actually, I saw the MGS evicts the client, or the revoke returning an error but not staying stale.

6895:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11
Generated at Sat Feb 10 02:37:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.