Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16717

Directory restripe breaking lmv stripe settings

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      While doing several directory restripes (parent dir only with "lfs migrate -m -d") at the same time, the Lustre client lost connection with the typical messages on the MDS asking to resume the migration process.

      Resuming doesn't work, starting the migration process on affected directories returns "File descriptor in bad state (77)".

      The current stripe setting of the affected directories show hash types of "none" and "bad_type", e.g.

      lmv_stripe_count: 5 lmv_stripe_offset: 2 lmv_hash_type: none,bad_type
      mdtidx           FID[seq:oid:ver]
           2           [0x280000bd1:0x6fb:0x0]
           0           [0x200000400:0xd16:0x0]
           2           [0x280000402:0xd16:0x0]
           1           [0x240000401:0xd16:0x0]
           3           [0x2c0000402:0xd16:0x0]
       

      The directories can be traversed, changes on the directories can't be done (adding or removing files).

      Several directory trees are affected through multiple levels. Trying to migrate the directories from the top level results in "Invalid argument (22)" instead of  -EBADFD.

      Debug messages on the MDS show that lmv_is_sane() is failing when trying to access the broken directories:

      Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:468:lmv_is_sane2()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
      [Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:446:lmv_is_sane()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
       

      Removing of the directories is also imposssible ("cannot remove <path>: invalid argument").

      LFSCK only has "striped_shards_skipped", but doesn't repair anything.

      The migration might have been started in parallel on several levels within a directory tree, maybe causing races or corruption. lfs migrate 

      If folders are affected, they are part of a whole affected path of directories from root+1 to a directory with no more subdirectories within, e.g.

      /root/dir1/dir2/dir3 with directories dir1 dir2 dir3 being affected.

      Is there any way to repair or remove the affected files? Data loss is not an issue for us, as this is a pre-production test setup, I want to report the issue anyway, because the problem immediately occured on starting the migration.

      Attachments

        1. debug_migrate_rmdir_leafdir.txt
          1.10 MB
          Patrick Keller
        2. dmesg_dommigrate.out
          5.44 MB
          Patrick Keller

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              keller Patrick Keller
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: