[LU-10674] MGS very unstable in 2.10.x - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.10.3, Lustre 2.10.4
Labels:
None
Environment:
3.10.0-693.2.2.el7_lustre.pl1.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in ~~LU-10390~~). This is concerning for a stable version.

MGS stuck this morning when trying to add a new OST:

[669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[669740.010557] Pid: 136320, comm: ll_mgs_0011
[669740.015223] 
Call Trace:
[669740.019798]  [<ffffffff816a94e9>] schedule+0x29/0x70
[669740.025437]  [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0
[669740.032077]  [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc]
[669740.039848]  [<ffffffff81098b20>] ? process_timeout+0x0/0x10
[669740.046276]  [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
[669740.054934]  [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
[669740.062704]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
[669740.069704]  [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
[669740.077777]  [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc]
[669740.085451]  [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs]
[669740.093331]  [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
[669740.101394]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
[669740.109372]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[669740.116950]  [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs]
[669740.123956]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[669740.131534]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
[669740.139498]  [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs]
[669740.146608]  [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
[669740.154208]  [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[669740.161997]  [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[669740.170655]  [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[669740.178328]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[669740.185420]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[669740.192041]  [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[669740.199133]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[669740.205350]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
[669740.211583]  [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[669740.218675]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[669740.224215]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[669740.229851]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[669740.235970]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0

[669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320

Clients output something like that:

[1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
[1466043.295179] Lustre: Skipped 1 previous similar message
[1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
[1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2
[1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
[1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
[1466657.299037] LustreError: Skipped 1 previous similar message
[1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i
[1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo
[1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
[1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3
[1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
[1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4
[1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
[1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)

Rebooting the MGS fixes the issue, until the next target failover/failback.

Stephane

Attachments

Issue Links

is related to

LU-13719 lov tgt 36 not cleaned! deathrow=0, lovrc=1

Resolved

is related to

LU-13356 lctl conf_param hung on the MGS node

Resolved

mentioned in: Page Loading...

MGS very unstable in 2.10.x

Details

Description

Attachments

Issue Links

Activity

People

Dates