Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3458

OST not able to register at MGS with predefined index.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.0.0, Lustre 2.1.0, Lustre 2.2.0, Lustre 2.3.0, Lustre 2.4.0, Lustre 1.8.x (1.8.0 - 1.8.5), Lustre 2.5.0
    • None
    • Any lustre from 1.6.0 with mountconf and OST prepared with --index option.
    • 3
    • 8645

    Description

      client may have lost a reply to register target operation, but MGS will think reply is delivered and mark a target as used, but client don't have an reply accepted and think it need restart register from beginning after reconnect.
      OOPS.
      MGS send response a index already used.

      [ 2619.730706] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2626.816706] Lustre: 56551:0:(client.c:1819:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1370729827/real 1370729827]  req@ffff8806014d5c00 x1437314393833474/t0(0) o253->MGC172.18.1.2@tcp@172.18.1.2@tcp:26/25 lens 4736/4736 e 0 to 1 dl 1370729834 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 2626.844904] LustreError: 166-1: MGC172.18.1.2@tcp: Connection to MGS (at 172.18.1.2@tcp) was lost; in progress operations using this service will fail
      [ 2626.896533] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2626.902111] Lustre: MGC172.18.1.2@tcp: Connection restored to MGS (at 172.18.1.2@tcp)
      [ 2629.380926] Lustre: MGC172.18.1.2@tcp: Reactivating import
      [ 2632.367220] LustreError: 15f-b: Communication to the MGS return error -98. Is the MGS running?
      [ 2632.376077] LustreError: 58337:0:(obd_mount.c:1834:server_fill_super()) Unable to start targets: -98
      

      attached logs describe that bug in details (log1 from MGS side, log2 from OSS side - initial register xid is x1437620344193026).

      Bug hit because MGS don't schedule a reply to the target register command, and assume client always get a reply. Bug originally hit on Xyratex b_neo_stable branch (mostly 2.1 codebase) but quick look say - bug exist at 2.4 also.

      Attachments

        1. log1
          6.34 MB
          Alexey Lyashkov
        2. log2
          5.78 MB
          Alexey Lyashkov

        Issue Links

          Activity

            People

              wc-triage WC Triage
              shadow Alexey Lyashkov
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: