Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16717

Directory restripe breaking lmv stripe settings

Details

    • 3
    • 9223372036854775807

    Description

      While doing several directory restripes (parent dir only with "lfs migrate -m -d") at the same time, the Lustre client lost connection with the typical messages on the MDS asking to resume the migration process.

      Resuming doesn't work, starting the migration process on affected directories returns "File descriptor in bad state (77)".

      The current stripe setting of the affected directories show hash types of "none" and "bad_type", e.g.

      lmv_stripe_count: 5 lmv_stripe_offset: 2 lmv_hash_type: none,bad_type
      mdtidx           FID[seq:oid:ver]
           2           [0x280000bd1:0x6fb:0x0]
           0           [0x200000400:0xd16:0x0]
           2           [0x280000402:0xd16:0x0]
           1           [0x240000401:0xd16:0x0]
           3           [0x2c0000402:0xd16:0x0]
       

      The directories can be traversed, changes on the directories can't be done (adding or removing files).

      Several directory trees are affected through multiple levels. Trying to migrate the directories from the top level results in "Invalid argument (22)" instead of  -EBADFD.

      Debug messages on the MDS show that lmv_is_sane() is failing when trying to access the broken directories:

      Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:468:lmv_is_sane2()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
      [Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:446:lmv_is_sane()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
       

      Removing of the directories is also imposssible ("cannot remove <path>: invalid argument").

      LFSCK only has "striped_shards_skipped", but doesn't repair anything.

      The migration might have been started in parallel on several levels within a directory tree, maybe causing races or corruption. lfs migrate 

      If folders are affected, they are part of a whole affected path of directories from root+1 to a directory with no more subdirectories within, e.g.

      /root/dir1/dir2/dir3 with directories dir1 dir2 dir3 being affected.

      Is there any way to repair or remove the affected files? Data loss is not an issue for us, as this is a pre-production test setup, I want to report the issue anyway, because the problem immediately occured on starting the migration.

      Attachments

        Issue Links

          Activity

            [LU-16717] Directory restripe breaking lmv stripe settings
            zam Alexander Zarochentsev added a comment - - edited

            I see a 2.15.3-based system in a similar but not exactly the same state, an interrupted mdt->mdt migration left a directory in an inaccessible state:

            # lfs getdirstripe .
            lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: none,migrating
            mdtidx FID[seq:oid:ver]
            0 [0x20002f9d0:0x6a:0x0]
            1 [0x240025042:0x62:0x0]
            1 [0x240025064:0x8e32:0x0]
            # touch test
            touch: cannot touch 'test': Bad file descriptor
             

            unlike the case in this ticket, the directory has not "bad_type" flag and the flag did not appear after an LFSCK run. So I think the patches from LU-16717 would not help as the checks inside them need "bad_type" flag to be set.

            zam Alexander Zarochentsev added a comment - - edited I see a 2.15.3-based system in a similar but not exactly the same state, an interrupted mdt->mdt migration left a directory in an inaccessible state: # lfs getdirstripe . lmv_stripe_count: 3 lmv_stripe_offset: 0 lmv_hash_type: none,migrating mdtidx FID[seq:oid:ver] 0 [0x20002f9d0:0x6a:0x0] 1 [0x240025042:0x62:0x0] 1 [0x240025064:0x8e32:0x0] # touch test touch: cannot touch 'test': Bad file descriptor unlike the case in this ticket, the directory has not "bad_type" flag and the flag did not appear after an LFSCK run. So I think the patches from LU-16717 would not help as the checks inside them need "bad_type" flag to be set.

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51243/
            Subject: LU-16717 mdt: resume dir migration with bad_type
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 1c882aebeaac4970c78a3616f1dd96d0920d133f

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51243/ Subject: LU-16717 mdt: resume dir migration with bad_type Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 1c882aebeaac4970c78a3616f1dd96d0920d133f

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51235/
            Subject: LU-16717 mdt: treat unknown hash type as sane type
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: e4208468b65a34c84c20d5d932f35b29f9025722

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51235/ Subject: LU-16717 mdt: treat unknown hash type as sane type Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: e4208468b65a34c84c20d5d932f35b29f9025722

            "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51243
            Subject: LU-16717 mdt: resume dir migration with bad_type
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 1b9b0c6f9c50218ab8549fa9795e5dd40f243b2d

            gerrit Gerrit Updater added a comment - "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51243 Subject: LU-16717 mdt: resume dir migration with bad_type Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 1b9b0c6f9c50218ab8549fa9795e5dd40f243b2d

            "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51235
            Subject: LU-16717 mdt: treat unknown hash type as sane type
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: ed44665dcfe40b5ae15d5733b177f434396854de

            gerrit Gerrit Updater added a comment - "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51235 Subject: LU-16717 mdt: treat unknown hash type as sane type Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: ed44665dcfe40b5ae15d5733b177f434396854de
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50797/
            Subject: LU-16717 mdt: resume dir migration with bad_type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 151650e468ab423e831c30d635ea380e0434a122

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50797/ Subject: LU-16717 mdt: resume dir migration with bad_type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 151650e468ab423e831c30d635ea380e0434a122

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50796/
            Subject: LU-16717 mdt: treat unknown hash type as sane type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 05cdb71ba6813570123613993f3cfcf74fc83561

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50796/ Subject: LU-16717 mdt: treat unknown hash type as sane type Project: fs/lustre-release Branch: master Current Patch Set: Commit: 05cdb71ba6813570123613993f3cfcf74fc83561
            hxing Xing Huang added a comment -

            2023-05-29: Both two patches being worked on are ready to land(on master-next branch).

            hxing Xing Huang added a comment - 2023-05-29: Both two patches being worked on are ready to land(on master-next branch).
            hxing Xing Huang added a comment -

            2023-05-20: Two patches being worked on, the first patch is being reviewed, another one passed code-review and is depending on the first one.

            hxing Xing Huang added a comment - 2023-05-20: Two patches being worked on, the first patch is being reviewed, another one passed code-review and is depending on the first one.

            People

              laisiyao Lai Siyao
              keller Patrick Keller
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: