Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3458

OST not able to register at MGS with predefined index.

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.0.0, Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5), Lustre 2.5.0
    • None
    • Any lustre from 1.6.0 with mountconf and OST prepared with --index option.
    • 3
    • 8645

    Description

      client may have lost a reply to register target operation, but MGS will think reply is delivered and mark a target as used, but client don't have an reply accepted and think it need restart register from beginning after reconnect.
      OOPS.
      MGS send response a index already used.

      [ 2619.730706] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2626.816706] Lustre: 56551:0:(client.c:1819:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1370729827/real 1370729827]  req@ffff8806014d5c00 x1437314393833474/t0(0) o253->MGC172.18.1.2@tcp@172.18.1.2@tcp:26/25 lens 4736/4736 e 0 to 1 dl 1370729834 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 2626.844904] LustreError: 166-1: MGC172.18.1.2@tcp: Connection to MGS (at 172.18.1.2@tcp) was lost; in progress operations using this service will fail
      [ 2626.896533] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2626.902111] Lustre: MGC172.18.1.2@tcp: Connection restored to MGS (at 172.18.1.2@tcp)
      [ 2629.380926] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2632.367220] LustreError: 15f-b: Communication to the MGS return error -98. Is the MGS running?
      [ 2632.376077] LustreError: 58337:0:(obd_mount.c:1834:server_fill_super()) Unable to start targets: -98
      

      attached logs describe that bug in details (log1 from MGS side, log2 from OSS side - initial register xid is x1437620344193026).

      Bug hit because MGS don't schedule a reply to the target register command, and assume client always get a reply. Bug originally hit on Xyratex b_neo_stable branch (mostly 2.1 codebase) but quick look say - bug exist at 2.4 also.

      Attachments

        1. log1
          6.34 MB
        2. log2
          5.78 MB

        Issue Links

          Activity

            [LU-3458] OST not able to register at MGS with predefined index.

            I filed LU-17716 which could address this issue. I propose "tunefs.lustre --replace" to allow updating the OST config/label so that it does not think it needs to register with the MGS again.

            adilger Andreas Dilger added a comment - I filed LU-17716 which could address this issue. I propose " tunefs.lustre --replace " to allow updating the OST config/label so that it does not think it needs to register with the MGS again.

            One more problem in that area - we have a single last_rcvd per disk storage, so MGS and MDT need to share a single last_rcvd file or we need to change name for such file to ability to run several recovery services on single disk partition.

            shadow Alexey Lyashkov added a comment - One more problem in that area - we have a single last_rcvd per disk storage, so MGS and MDT need to share a single last_rcvd file or we need to change name for such file to ability to run several recovery services on single disk partition.
            pjones Peter Jones added a comment -

            ok. Thanks Alexey!

            pjones Peter Jones added a comment - ok. Thanks Alexey!

            Currently not, I have more interested to fix LNet issues now and workaround exist. But i will happy to discus about generic recovery for an MGS as it's long aged task.

            shadow Alexey Lyashkov added a comment - Currently not, I have more interested to fix LNet issues now and workaround exist. But i will happy to discus about generic recovery for an MGS as it's long aged task.

            So, to clarify, I'm not against fixing the MGS/MGC recovery code, but this needs to be done carefully to avoid the problem of complex recovery on the MGS. It should only be done for OSTs connecting during initial registration.

            Shadow, are you planning on working on this problem?

            adilger Andreas Dilger added a comment - So, to clarify, I'm not against fixing the MGS/MGC recovery code, but this needs to be done carefully to avoid the problem of complex recovery on the MGS. It should only be done for OSTs connecting during initial registration. Shadow, are you planning on working on this problem?

            There is no longer dynamic index mapping in Lustre 2.4+ because this was never used by real systems where the administrator wants to know which OST index is on a particular node.

            I agree with Alex that last_rcvd would handle the reply reconstruction, but it also means the MGS would need to "recover" if all clients are in the last_rcvd file (which would be bad). The clients should NOT be added to the last_rcvd, only the new servers. Ideally, there should be some way for the OST to remove itself from the last_rcvd file after it has registered and gotten a reply?

            One danger (and the reason this is an error in the first place) is that you don't want multiple OSTs accidentally trying to claim that they are the same OST index. That would cause serious filesystem corruption. That means there should be some way to determine the OST is the right one (e.g. RPC XID) before sending the reconstructed reply. This wouldn't work if both the MGS and OSS rebooted, but is probably better than today.

            I was also wondering if there should also be some way to add a newly-formatted OST to replace an old OST with a "--replace" option to mkfs.lustre, which removes the "LDD_F_VIRGIN" flag and allows it to take over the old OST slot. This isn't directly related, but a similar problem. That would mean at least the administrator has to understand what is happening before adding the OST in an existing index.

            adilger Andreas Dilger added a comment - There is no longer dynamic index mapping in Lustre 2.4+ because this was never used by real systems where the administrator wants to know which OST index is on a particular node. I agree with Alex that last_rcvd would handle the reply reconstruction, but it also means the MGS would need to "recover" if all clients are in the last_rcvd file (which would be bad). The clients should NOT be added to the last_rcvd, only the new servers. Ideally, there should be some way for the OST to remove itself from the last_rcvd file after it has registered and gotten a reply? One danger (and the reason this is an error in the first place) is that you don't want multiple OSTs accidentally trying to claim that they are the same OST index. That would cause serious filesystem corruption. That means there should be some way to determine the OST is the right one (e.g. RPC XID) before sending the reconstructed reply. This wouldn't work if both the MGS and OSS rebooted, but is probably better than today. I was also wondering if there should also be some way to add a newly-formatted OST to replace an old OST with a "--replace" option to mkfs.lustre, which removes the "LDD_F_VIRGIN" flag and allows it to take over the old OST slot. This isn't directly related, but a similar problem. That would mean at least the administrator has to understand what is happening before adding the OST in an existing index.

            Looks we need implement a long aged task about simplify recovery for an MGS.
            add_target will send a transaction number to the target and that request will be resend to the MGS in case MGS crash. if request resend but it's committed on MGS - reply is reconstructed from an on disk state. But it's may be too hard with dynamic index allocation from same NID as MGS don't know about mapping and it's produce bitmap leak (but looks that bug exist today also).

            shadow Alexey Lyashkov added a comment - Looks we need implement a long aged task about simplify recovery for an MGS. add_target will send a transaction number to the target and that request will be resend to the MGS in case MGS crash. if request resend but it's committed on MGS - reply is reconstructed from an on disk state. But it's may be too hard with dynamic index allocation from same NID as MGS don't know about mapping and it's produce bitmap leak (but looks that bug exist today also).

            I think MGS could be able to reconstruct reply using last_rcvd, but that doesn't cover the case when MGS crashes and the reply was lost. I guess in the latter case the only way is to remove new profile with writeconf.

            bzzz Alex Zhuravlev added a comment - I think MGS could be able to reconstruct reply using last_rcvd, but that doesn't cover the case when MGS crashes and the reply was lost. I guess in the latter case the only way is to remove new profile with writeconf.

            Xyratex-bug MRP-1111

            shadow Alexey Lyashkov added a comment - Xyratex-bug MRP-1111

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: