Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16717

Directory restripe breaking lmv stripe settings

Details

    • 3
    • 9223372036854775807

    Description

      While doing several directory restripes (parent dir only with "lfs migrate -m -d") at the same time, the Lustre client lost connection with the typical messages on the MDS asking to resume the migration process.

      Resuming doesn't work, starting the migration process on affected directories returns "File descriptor in bad state (77)".

      The current stripe setting of the affected directories show hash types of "none" and "bad_type", e.g.

      lmv_stripe_count: 5 lmv_stripe_offset: 2 lmv_hash_type: none,bad_type
      mdtidx           FID[seq:oid:ver]
           2           [0x280000bd1:0x6fb:0x0]
           0           [0x200000400:0xd16:0x0]
           2           [0x280000402:0xd16:0x0]
           1           [0x240000401:0xd16:0x0]
           3           [0x2c0000402:0xd16:0x0]
       

      The directories can be traversed, changes on the directories can't be done (adding or removing files).

      Several directory trees are affected through multiple levels. Trying to migrate the directories from the top level results in "Invalid argument (22)" instead of  -EBADFD.

      Debug messages on the MDS show that lmv_is_sane() is failing when trying to access the broken directories:

      Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:468:lmv_is_sane2()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
      [Thu Apr  6 09:36:53 2023] LustreError: 1952126:0:(lustre_lmv.h:446:lmv_is_sane()) insane LMV: magic=0xcd20cd0 count=5 index=1 hash=none:0x20000000 version=3 migrate offset=1 migrate hash=fnv_1a_64:33554434.
       

      Removing of the directories is also imposssible ("cannot remove <path>: invalid argument").

      LFSCK only has "striped_shards_skipped", but doesn't repair anything.

      The migration might have been started in parallel on several levels within a directory tree, maybe causing races or corruption. lfs migrate 

      If folders are affected, they are part of a whole affected path of directories from root+1 to a directory with no more subdirectories within, e.g.

      /root/dir1/dir2/dir3 with directories dir1 dir2 dir3 being affected.

      Is there any way to repair or remove the affected files? Data loss is not an issue for us, as this is a pre-production test setup, I want to report the issue anyway, because the problem immediately occured on starting the migration.

      Attachments

        Issue Links

          Activity

            [LU-16717] Directory restripe breaking lmv stripe settings
            hxing Xing Huang added a comment -

            2023-05-08: Two patches being worked on.

            hxing Xing Huang added a comment - 2023-05-08: Two patches being worked on.
            laisiyao Lai Siyao added a comment -

            Patrick, I just pushed two patches to address the issues listed above, will you try the latter one https://review.whamcloud.com/c/fs/lustre-release/+/50797 to see if it can fix your issue?

            laisiyao Lai Siyao added a comment - Patrick, I just pushed two patches to address the issues listed above, will you try the latter one https://review.whamcloud.com/c/fs/lustre-release/+/50797 to see if it can fix your issue?

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50797
            Subject: LU-16717 mdt: resume dir migration with bad_type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4054056c6268ee47850f54a57019b819f108738f

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50797 Subject: LU-16717 mdt: resume dir migration with bad_type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4054056c6268ee47850f54a57019b819f108738f

            "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50796
            Subject: LU-16717 mdt: treat unknown hash type as sane type
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: be2d62766e7a0d75846c9482f8105fd5ae7bf461

            gerrit Gerrit Updater added a comment - "Lai Siyao <lai.siyao@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50796 Subject: LU-16717 mdt: treat unknown hash type as sane type Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: be2d62766e7a0d75846c9482f8105fd5ae7bf461

            Lai, I've always wondered about cases like this (eg. bad hash type). It should be possible for the MDS to "complete" the migration in any case. At worst it could pick a new hash type and move more entries around, but at least it would get the directory out of the "MIGRATION" state.

            adilger Andreas Dilger added a comment - Lai, I've always wondered about cases like this (eg. bad hash type). It should be possible for the MDS to "complete" the migration in any case. At worst it could pick a new hash type and move more entries around, but at least it would get the directory out of the " MIGRATION " state.
            laisiyao Lai Siyao added a comment -

            It's not supported to resume directory migration upon bad hash type. Under this condition, user can still access this directory and sub files, but can't unlink them because server code doesn't support name lookup with bad hash (it should try each stripe as what's done on client).

            I will look into these two issues:

            1. upon bad hash type, directory migration can try to resume any way, and there is big chance of success.
            2. support unlink of such directory and files.
            laisiyao Lai Siyao added a comment - It's not supported to resume directory migration upon bad hash type. Under this condition, user can still access this directory and sub files, but can't unlink them because server code doesn't support name lookup with bad hash (it should try each stripe as what's done on client). I will look into these two issues: upon bad hash type, directory migration can try to resume any way, and there is big chance of success. support unlink of such directory and files.

            I've attached logs (debug_migrate_rmdir_leafdir) that show the debug output when trying to migrate and remove an affected directory. I ran the command on the MDS with MDT0002 because I think that's where the issue occurs.

            I created the directory list that i used for directory migration after the first "rc = -22" errors apeared, so the issue might have been present before restriping of the directories even started. The fids affected at that time can't be resolved bs lfs fid2path right now due to -EBADFD.

            I also did restriping of DoM prior on the same day, so that might actually be the root cause of the issue. attached the kernel messages for Lustre on the client doing the DoM migration in dmesg_dommigrate. I left the dump_lsm() messages in for completeness, although the messages seem to be redundant.

            keller Patrick Keller added a comment - I've attached logs (debug_migrate_rmdir_leafdir) that show the debug output when trying to migrate and remove an affected directory. I ran the command on the MDS with MDT0002 because I think that's where the issue occurs. I created the directory list that i used for directory migration after the first "rc = -22" errors apeared, so the issue might have been present before restriping of the directories even started. The fids affected at that time can't be resolved bs lfs fid2path right now due to -EBADFD. I also did restriping of DoM prior on the same day, so that might actually be the root cause of the issue. attached the kernel messages for Lustre on the client doing the DoM migration in dmesg_dommigrate. I left the dump_lsm() messages in for completeness, although the messages seem to be redundant.

            keller, thanks for the background, and I appreciate the bug report. I just wanted to make sure that you weren't intentionally trying to restripe the whole filesystem to fully-striped directories, since we've had a number of significant performance problems when users think that will somehow improve performance, and I wanted to steer you away from that. As I wrote, the 2.15 MDT space balancing and round-robin directory creation has proven very useful.

            In case we aren't able to reproduce the issue here, it would definitely be useful to collect full debug logs from the MDS (assuming you have a client mounted there also):

            # lctl set_param debug=all debug_mb=1024
            # lctl clear
            # lctl mark "migrate"
            # lfs migrate -m -d ... <dir>
            # lctl mark "rmdir"
            # rmdir <dir>
            # lctl dk /tmp/debug.migrate.txt
            

            along with any console errors from the MDS from the time of original migrate error.

            adilger Andreas Dilger added a comment - keller , thanks for the background, and I appreciate the bug report. I just wanted to make sure that you weren't intentionally trying to restripe the whole filesystem to fully-striped directories, since we've had a number of significant performance problems when users think that will somehow improve performance, and I wanted to steer you away from that. As I wrote, the 2.15 MDT space balancing and round-robin directory creation has proven very useful. In case we aren't able to reproduce the issue here, it would definitely be useful to collect full debug logs from the MDS (assuming you have a client mounted there also): # lctl set_param debug=all debug_mb=1024 # lctl clear # lctl mark "migrate" # lfs migrate -m -d ... <dir> # lctl mark "rmdir" # rmdir <dir> # lctl dk /tmp/debug.migrate.txt along with any console errors from the MDS from the time of original migrate error.

            The intent was to test the behavior of the file system when doing a lot of directory restriping to see if thats something that could be done in a production environment for rebalancing MDTs as well as specifically trying to test impact on single directory metadata performance.

            For production, we will set the round-robin depth to go into the first level within project-specific directories for initial load distribution and will let the load balancing do the rest.

            The ticket is only supposed to point at a possible issue that came up during restriping and/or lfsck and not supposed to lead to any new features.

            keller Patrick Keller added a comment - The intent was to test the behavior of the file system when doing a lot of directory restriping to see if thats something that could be done in a production environment for rebalancing MDTs as well as specifically trying to test impact on single directory metadata performance. For production, we will set the round-robin depth to go into the first level within project-specific directories for initial load distribution and will let the load balancing do the rest. The ticket is only supposed to point at a possible issue that came up during restriping and/or lfsck and not supposed to lead to any new features.

            keller,
            I haven't looked into the details of this issue yet, but wanted to take a step back to ask what your goal for the directory restripe was, so that the resolution can be geared toward addressing the issue in the right way. Were you restriping the whole directory tree from 1-stripe to 4-stripe directories? IMHO, that would be counter-productive in terms of performance and reliability vs. using 1-stripe directories for all but the very largest directories (1M entries or more).

            For Lustre 2.15 there is automatic MDT space balancing (should keep MDT usage within about 5% of each other). By default the top 3 levels of the directory tree have round-robin MDT directory creation to better utilize all MDTs in the system. This will create 1-stripe directories on remote MDTs to ensure all of the MDTs are being used. The round-robin MDT allocation can be enabled deeper into the directory tree with "lfs setdirstripe -D -c 1 -i -1 --max-inherit-rr=N <dir>", if needed, but the defaults should provide reasonable MDT usage out of the box. With the MDT space balancing there is very little need to use striped directories except for huge single directories.

            adilger Andreas Dilger added a comment - keller , I haven't looked into the details of this issue yet, but wanted to take a step back to ask what your goal for the directory restripe was, so that the resolution can be geared toward addressing the issue in the right way. Were you restriping the whole directory tree from 1-stripe to 4-stripe directories? IMHO, that would be counter-productive in terms of performance and reliability vs. using 1-stripe directories for all but the very largest directories (1M entries or more). For Lustre 2.15 there is automatic MDT space balancing (should keep MDT usage within about 5% of each other). By default the top 3 levels of the directory tree have round-robin MDT directory creation to better utilize all MDTs in the system. This will create 1-stripe directories on remote MDTs to ensure all of the MDTs are being used. The round-robin MDT allocation can be enabled deeper into the directory tree with " lfs setdirstripe -D -c 1 -i -1 --max-inherit-rr=N <dir> ", if needed, but the defaults should provide reasonable MDT usage out of the box. With the MDT space balancing there is very little need to use striped directories except for huge single directories.

            The "lfsck_striped_dir.c" documentation suggests that the "LMV_HASH_FLAG_BAD_TYPE" flag has been introduced to distinguish between cases with a valid hash on master LMV EA and the ones where anyone needs to be trusted because there is no valid hash found yet.

            However, in my case there seems to be no valid hash to be found anywhere. There should be a way to ignore LMV_HASH_FLAG_BAD_TYPE in order to remove the file. As of right now, it seems the directory is marked read-only as it is supposed to for "LMV_HASH_TYPE_UNKNOWN" but there is no exception to still be able to delete it.

             

            I have a few debug outputs that show lfsck trying to fix directories with bad name hashes (flag = 4), which seem to show the behavior described above. I assume that lfsck set the bad_type flag but can't remove it.

            keller Patrick Keller added a comment - The "lfsck_striped_dir.c" documentation suggests that the "LMV_HASH_FLAG_BAD_TYPE" flag has been introduced to distinguish between cases with a valid hash on master LMV EA and the ones where anyone needs to be trusted because there is no valid hash found yet. However, in my case there seems to be no valid hash to be found anywhere. There should be a way to ignore LMV_HASH_FLAG_BAD_TYPE in order to remove the file. As of right now, it seems the directory is marked read-only as it is supposed to for "LMV_HASH_TYPE_UNKNOWN" but there is no exception to still be able to delete it.   I have a few debug outputs that show lfsck trying to fix directories with bad name hashes (flag = 4), which seem to show the behavior described above. I assume that lfsck set the bad_type flag but can't remove it.

            People

              laisiyao Lai Siyao
              keller Patrick Keller
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: