Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10674

MGS very unstable in 2.10.x

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.3, Lustre 2.10.4
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in LU-10390). This is concerning for a stable version.

      MGS stuck this morning when trying to add a new OST:

      [669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [669740.010557] Pid: 136320, comm: ll_mgs_0011
      [669740.015223] 
      Call Trace:
      [669740.019798]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [669740.025437]  [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0
      [669740.032077]  [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc]
      [669740.039848]  [<ffffffff81098b20>] ? process_timeout+0x0/0x10
      [669740.046276]  [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
      [669740.054934]  [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [669740.062704]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      [669740.069704]  [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
      [669740.077777]  [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc]
      [669740.085451]  [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs]
      [669740.093331]  [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
      [669740.101394]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      [669740.109372]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [669740.116950]  [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs]
      [669740.123956]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [669740.131534]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      [669740.139498]  [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs]
      [669740.146608]  [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [669740.154208]  [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [669740.161997]  [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [669740.170655]  [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [669740.178328]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [669740.185420]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [669740.192041]  [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [669740.199133]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [669740.205350]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      [669740.211583]  [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [669740.218675]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [669740.224215]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      [669740.229851]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [669740.235970]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      
      [669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320
      
      

      Clients output something like that:

      [1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
      [1466043.295179] Lustre: Skipped 1 previous similar message
      [1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
      [1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2
      [1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
      [1466657.299037] LustreError: Skipped 1 previous similar message
      [1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i
      [1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo
      [1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
      [1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3
      [1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4
      [1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
      [1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
      
      

      Rebooting the MGS fixes the issue, until the next target failover/failback.

      Stephane
       

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: