Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.3, Lustre 2.10.4
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We keep having issues with the MGS since the upgrade from 2.9 to 2.10 LTS. As soon as we failover/failback some target, the MGS seems to be stuck. Additionally, stopping the MGS always triggers a crash (reported in LU-10390). This is concerning for a stable version.

      MGS stuck this morning when trying to add a new OST:

      [669739.991439] LNet: Service thread pid 136320 was inactive for 200.27s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [669740.010557] Pid: 136320, comm: ll_mgs_0011
      [669740.015223] 
      Call Trace:
      [669740.019798]  [<ffffffff816a94e9>] schedule+0x29/0x70
      [669740.025437]  [<ffffffff816a6f34>] schedule_timeout+0x174/0x2c0
      [669740.032077]  [<ffffffffc0b6bef1>] ? ldlm_run_ast_work+0x1d1/0x3a0 [ptlrpc]
      [669740.039848]  [<ffffffff81098b20>] ? process_timeout+0x0/0x10
      [669740.046276]  [<ffffffffc0b85020>] ? ldlm_expired_completion_wait+0x0/0x240 [ptlrpc]
      [669740.054934]  [<ffffffffc0b85811>] ldlm_completion_ast+0x5b1/0x920 [ptlrpc]
      [669740.062704]  [<ffffffff810c4810>] ? default_wake_function+0x0/0x20
      [669740.069704]  [<ffffffffc138e75c>] mgs_completion_ast_generic+0x5c/0x200 [mgs]
      [669740.077777]  [<ffffffffc0b6a6bc>] ? ldlm_lock_create+0x1fc/0xa30 [ptlrpc]
      [669740.085451]  [<ffffffffc138e973>] mgs_completion_ast_config+0x13/0x20 [mgs]
      [669740.093331]  [<ffffffffc0b87730>] ldlm_cli_enqueue_local+0x230/0x860 [ptlrpc]
      [669740.101394]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      [669740.109372]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [669740.116950]  [<ffffffffc139335c>] mgs_revoke_lock+0xfc/0x370 [mgs]
      [669740.123956]  [<ffffffffc0b8ae00>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
      [669740.131534]  [<ffffffffc138e960>] ? mgs_completion_ast_config+0x0/0x20 [mgs]
      [669740.139498]  [<ffffffffc1393ae5>] mgs_target_reg+0x515/0x1370 [mgs]
      [669740.146608]  [<ffffffffc0bbb0b1>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
      [669740.154208]  [<ffffffffc0c1dda5>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [669740.161997]  [<ffffffffc0bc6b16>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      [669740.170655]  [<ffffffffc0bc3148>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [669740.178328]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      [669740.185420]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      [669740.192041]  [<ffffffffc0bca252>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [669740.199133]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [669740.205350]  [<ffffffff816a8f00>] ? __schedule+0x2f0/0x8b0
      [669740.211583]  [<ffffffffc0bc97c0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [669740.218675]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      [669740.224215]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      [669740.229851]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      [669740.235970]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      
      [669740.243363] LustreError: dumping log to /tmp/lustre-log.1518720988.136320
      
      

      Clients output something like that:

      [1466043.295178] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
      [1466043.295179] Lustre: Skipped 1 previous similar message
      [1466043.767551] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) Skipped 1 previous similar message
      [1466351.198284] LustreError: 368700:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff8823eebbb2c0) refcount = 2
      [1466351.242084] LustreError: 368700:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [1466657.253528] LustreError: 166-1: MGC10.0.2.51@o2ib5: Connection to MGS (at 10.0.2.51@o2ib5) was lost; in progress operations using this service will fail
      [1466657.299037] LustreError: Skipped 1 previous similar message
      [1466657.317969] LustreError: 5993:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1518719899, 300s ago), entering recovery for MGS@MGC10.0.2.51@o2ib5_0 ns: MGC10.0.2.51@o2i
      [1466657.318229] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) MGC10.0.2.51@o2ib5: namespace resource [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount nonzero (2) after lock cleanup; fo
      [1466657.318230] LustreError: 372154:0:(ldlm_resource.c:1100:ldlm_resource_complain()) Skipped 1 previous similar message
      [1466657.318232] LustreError: 372154:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource: [0x6b616f:0x2:0x0].0x0 (ffff883aca373200) refcount = 3
      [1466657.318233] LustreError: 372154:0:(ldlm_resource.c:1703:ldlm_resource_dump()) Waiting locks:
      [1466657.318238] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) ### ### ns: MGC10.0.2.51@o2ib5 lock: ffff883193225800/0xe5ac076a284d2d lrc: 4/1,0 mode: --/CR res: [0x6b616f:0x2:0x0].0x0 rrc: 4
      [1466657.318239] LustreError: 372154:0:(ldlm_resource.c:1705:ldlm_resource_dump()) Skipped 1 previous similar message
      [1466657.318244] Lustre: MGC10.0.2.51@o2ib5: Connection restored to MGC10.0.2.51@o2ib5_0 (at 10.0.2.51@o2ib5)
      
      

      Rebooting the MGS fixes the issue, until the next target failover/failback.

      Stephane
       

      Attachments

        Issue Links

          Activity

            [LU-10674] MGS very unstable in 2.10.x

            Hello Hongchao Zhang,

            I'm not sure the patch you pointed to will really fix this problem.

            If I understand correctly, if one client is dead and not yet evicted by MGS while an IR is happening, the problem should happen. I tested that multiple times without being able to reproduce it. Actually, I saw the MGS evicts the client, or the revoke returning an error but not staying stale.

            6895:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11
            degremoa Aurelien Degremont (Inactive) added a comment - - edited Hello Hongchao Zhang, I'm not sure the patch you pointed to will really fix this problem. If I understand correctly, if one client is dead and not yet evicted by MGS while an IR is happening, the problem should happen. I tested that multiple times without being able to reproduce it. Actually, I saw the MGS evicts the client, or the revoke returning an error but not staying stale. 6895:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11

            this issue could be caused by the same reason in LU-13356 (the patch is https://review.whamcloud.com/37880)

            hongchao.zhang Hongchao Zhang added a comment - this issue could be caused by the same reason in LU-13356 (the patch is https://review.whamcloud.com/37880 )

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34958
            Subject: LU-10674 ldlm: only check granted plain locks
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bd8a05bde10b4e8a22f941522e06b033c1c88a68

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34958 Subject: LU-10674 ldlm: only check granted plain locks Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bd8a05bde10b4e8a22f941522e06b033c1c88a68

            Hi Bruno and John,

            For this case, all targets are running Lustre 2.10.3 RC1.

            Because this system running 2.10 has been quite unstable lately, we don't want to trigger new problems right now, but next time I have too failover/failback, I'll try to gather more Lustre debug logs! Thanks!

            sthiell Stephane Thiell added a comment - Hi Bruno and John, For this case, all targets are running Lustre 2.10.3 RC1. Because this system running 2.10 has been quite unstable lately, we don't want to trigger new problems right now, but next time I have too failover/failback, I'll try to gather more Lustre debug logs! Thanks!
            jhammond John Hammond added a comment -

            Hi Stephane,

            Could you describe the interop situation here? What version is the MGS and what versions are the targets?

            jhammond John Hammond added a comment - Hi Stephane, Could you describe the interop situation here? What version is the MGS and what versions are the targets?

            Hello Stephane!
            We can try to reproduce in-house, but as you appear to be able to reproduce easily may be you could provide with a full Lustre debug log from all sides when causing a target failover/failback?

            Also, about "stopping the MGS always triggers a crash (reported in LU-10390)", again we can try to reproduce in-house, but may be you have a crash-dump already available that could be analyzed as a first step?

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Stephane! We can try to reproduce in-house, but as you appear to be able to reproduce easily may be you could provide with a full Lustre debug log from all sides when causing a target failover/failback? Also, about "stopping the MGS always triggers a crash (reported in LU-10390 )", again we can try to reproduce in-house, but may be you have a crash-dump already available that could be analyzed as a first step?

            People

              hongchao.zhang Hongchao Zhang
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: