Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13417

DNE3: mkdir() automatically create remote directory on MDS which has more space

Details

    • Improvement
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.13.0
    • 3
    • 9223372036854775807

    Description

      Since the patches from LU-11213 landed for 2.13.0, I thought "lfs setdirstripe -i -1 /mnt/lustre" on the root or other existing directory would allow creation of remote directories on other MDTs using plain "mkdir" commands. This is different than the case of "lfs setdirstripe -i -1 -c N /mnt/lustre" selecting stripes on less-full MDTs that was landed via patch https://review.whamcloud.com/35825 "LU-12624 lod: alloc dir stripes by QoS", but this patch also removed the "space" hash, so I thought that regular mkdir of a directory could be allowed to balance across MDTs?

      However, I can't seem to get this to work. On current master (2.13.52-259, just before 2.13.53) I'm not able to use "lfs setdirstripe -i -1 /path/to/dir" on an existing directory. It seems to select the less-full MDT if I explicitly run "lfs mkdir -i -1" for a new directory, but that was also true in 2.12 using patch https://review.whamcloud.com/30598 "LU-10277 utils: 'lfs mkdir -i -1' pick the less full MDTs", so it isn't clear how to enable the LU-11213 functionality to balance directories across MDTs?

      There should be a way for "mkdir(2)" from a normal application (not "lfs mkdir -i -1") to be able to create remote (1-stripe) directories in the filesystem, and it should be possible to set this by default on the root directory (per LU-11213). This is critical for being able to use multiple MDTs effectively without users knowing the details of how to configure striped/remote directories manually, or be forced to set all directories as striped (unwelcome due to performance overhead).

      The default mdt_qos_threshold_rr value should be reduced significantly (e.g. 1-2% and/or modified so that some amount of MDT balancing is active when the filesystem is balanced, at least in the root directory by default. Otherwise, without users understanding the details of DNE MDT0000 will hold all of the inodes, when it would be better if the top 1 or 2 levels of directories should be distributed across MDTs.

      Maybe this is mostly a documentation issue, and the "lfs-setdirstripe.1" man page needs to be updated to be more clear so I can understand what needs to be done to enable this? (also the usage message for setdirstripe/mkdir should remove the "This can only be done on MDT0 with the right of administrator" message.)

      Attachments

        Issue Links

          Activity

            [LU-13417] DNE3: mkdir() automatically create remote directory on MDS which has more space
            laisiyao Lai Siyao added a comment -

            It's strange replay-dual 22d always failed, but can pass on autotest for other patches.

            When I tried to test replay-dual alone on master code, it also failed on autotest.

            laisiyao Lai Siyao added a comment - It's strange replay-dual 22d always failed, but can pass on autotest for other patches. When I tried to test replay-dual alone on master code, it also failed on autotest.

            Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40925
            Subject: LU-13417 test: dump replay-dual 22d debug log
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: fedcbc7741ad5928b3c2f40a910c2126f37f060c

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40925 Subject: LU-13417 test: dump replay-dual 22d debug log Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: fedcbc7741ad5928b3c2f40a910c2126f37f060c

            I pushed the above patch as a starting point for getting this working, but it needs some additional work to finish it off.  Hongchao or Lai, can you please finish off that patch so that we can get it included into 2.14.

            adilger Andreas Dilger added a comment - I pushed the above patch as a starting point for getting this working, but it needs some additional work to finish it off.  Hongchao or Lai, can you please finish off that patch so that we can get it included into 2.14.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38553
            Subject: LU-13417 mdd: default DNE MDT balance on new filesystems
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 46feb12b9d64b366ba8cc0b5b842824add5a23c2

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38553 Subject: LU-13417 mdd: default DNE MDT balance on new filesystems Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 46feb12b9d64b366ba8cc0b5b842824add5a23c2
            adilger Andreas Dilger added a comment - - edited

            I was looking at the LMV code and tested to verify that the LU-11213 implementation of "lmv_create()" will already do round-robin allocation across MDTs in a directory with "-D -c 1 -i -1" set, and will only use QOS weight-balanced MDT selection if the MDT space imbalance is over the qos_threshold_rr limit. This is enabled only when a default directory layout is set on the root directory. It will also only be inherited one level down, because lum_stripe_count=1 default layouts are changed to lum_stripe_count=0 in lod_ah_init(), which is no longer considered to be inherited.

            I think that means there is a (hopefully simple) change that can be done to make this functionality more useful for filesystems:

            • explicitly set "trusted.dmv = .lum_magic=LMV_USER_MAGIC, .lum_stripe_count=1, .lum_stripe_index=-1" xattr in mdd_prepare() for newly formatted filesystems
            • one disadvantage is that this will only be inherited for the top-level directory, and would need LU-13440 to be inherited for 2-3 levels
            • another disadvantage is that this needs to be enabled manually by the user on "ROOT/" for existing filesystems

            While I think this will not be perfect, it will be a lot better than defaulting to not using all of the other MDTs unless the user knows to explicitly use "lfs mkdir" and/or "lfs setdirstripe -D" on the filesystem to start using other MDTs.

            adilger Andreas Dilger added a comment - - edited I was looking at the LMV code and tested to verify that the LU-11213 implementation of " lmv_create() " will already do round-robin allocation across MDTs in a directory with " -D -c 1 -i -1 " set, and will only use QOS weight-balanced MDT selection if the MDT space imbalance is over the qos_threshold_rr limit. This is enabled only when a default directory layout is set on the root directory. It will also only be inherited one level down, because lum_stripe_count=1 default layouts are changed to lum_stripe_count=0 in lod_ah_init() , which is no longer considered to be inherited. I think that means there is a (hopefully simple) change that can be done to make this functionality more useful for filesystems: explicitly set " trusted.dmv = .lum_magic=LMV_USER_MAGIC, .lum_stripe_count=1, .lum_stripe_index=-1 " xattr in mdd_prepare() for newly formatted filesystems one disadvantage is that this will only be inherited for the top-level directory, and would need LU-13440 to be inherited for 2-3 levels another disadvantage is that this needs to be enabled manually by the user on " ROOT/ " for existing filesystems While I think this will not be perfect, it will be a lot better than defaulting to not using all of the other MDTs unless the user knows to explicitly use " lfs mkdir " and/or " lfs setdirstripe -D " on the filesystem to start using other MDTs.

            Peter, this still needs some work to make the remote MDT selection heuristics a bit better.

            For top-level directories, it makes sense to round-robin them when the MDTs are relatively empty, and only pick a specific MDT when they are imbalanced.

            Also, I think the free space threshold needs to be smaller for MDT imbalance than for OST imbalance.

            adilger Andreas Dilger added a comment - Peter, this still needs some work to make the remote MDT selection heuristics a bit better. For top-level directories, it makes sense to round-robin them when the MDTs are relatively empty, and only pick a specific MDT when they are imbalanced. Also, I think the free space threshold needs to be smaller for MDT imbalance than for OST imbalance.
            pjones Peter Jones added a comment -

            Landed for 2.14

            pjones Peter Jones added a comment - Landed for 2.14

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38160/
            Subject: LU-13417 utils: lfs setdirstripe -D -i -1 should work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: eeab6942a8dc65dab789c7ca85cc31ba5cee74f3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/38160/ Subject: LU-13417 utils: lfs setdirstripe -D -i -1 should work Project: fs/lustre-release Branch: master Current Patch Set: Commit: eeab6942a8dc65dab789c7ca85cc31ba5cee74f3

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38160
            Subject: LU-13417 utils: lfs setstripe -D -i -1 should work
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 8da6905fb197ef226ca091f53335189f243bbcbe

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38160 Subject: LU-13417 utils: lfs setstripe -D -i -1 should work Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8da6905fb197ef226ca091f53335189f243bbcbe
            adilger Andreas Dilger added a comment - - edited

            It's because 'lfs setdirstripe -D -i -1 <dir>' is used to delete default stripe... You need to use 'lfs setdirstripe -D -i -1 -c 1' to enable balanced subdirectories creation for plain directories.

            I think this is a user interface bug then, or a bug in how LOD is interpreting the layout. In my testing, "lfs getdirstripe" showed "lmv_stripe_count:0 lmv_stripe_offset:-1 lmv_hash_type:none" was set on the directory, but it didn't affect the creation of directories with "mkdir()".

            I would expect "lfs setdirstripe -d -D $dir" to delete the default layout for a directory, which seems to work, with "-d" already implying "-D" internally, but it is non-obvious because "lmv_stripe_offset:-1" is actually the default value, so "deleting" this layout didn't help.

            I would also expect "lfs setdirstripe -D -i -1" to set the default layout to create remote directories, matching how "lfs setstripe" works. There were other users confused by this as well. The missing part is that specifying only "-i -1" is internally using the same as "-c 0" which actually results in the existing layout to be reset to the default (local directory creation). I'll push a small patch that makes "-D -i -1" set "-c 1" internally if the stripe count is not specified, so that it doesn't result in unexpected behavior for the user.

            Another issue is that the default "qos_threshold_rr=17%" is too high to start balancing directory creation across MDTs. This might mean that MDT0000 is used for many millions of files and top-level directories before any balancing is even started. At that point it will be very difficult to return the balance of the MDTs because so many top-level directories and subdirectories have been created on MDT0000. I think it would be better to start space balancing and/or round-robin MDT selection for root directory entries right away if "lmv_stripe_count:1 lmv_stripe_offset:-1" is set on "ROOT/" (which I think we should make the default for 2.14). If there is only a single MDT then this is no change to behavior, but for multiple MDTs it will start using all MDTs right away at the root level and prevent the MDTs from becoming unbalanced in the first place. If we special-case MDT0000 to be RR/balanced immediately, then a smaller qos_threshold_rr=5% may still be useful to avoid the MDTs becoming too imbalanced, but will be less likely to be needed.

            adilger Andreas Dilger added a comment - - edited It's because 'lfs setdirstripe -D -i -1 <dir>' is used to delete default stripe... You need to use ' lfs setdirstripe -D -i -1 -c 1 ' to enable balanced subdirectories creation for plain directories. I think this is a user interface bug then, or a bug in how LOD is interpreting the layout. In my testing, " lfs getdirstripe " showed " lmv_stripe_count:0 lmv_stripe_offset:-1 lmv_hash_type:none " was set on the directory, but it didn't affect the creation of directories with " mkdir() ". I would expect " lfs setdirstripe -d -D $dir " to delete the default layout for a directory, which seems to work, with " -d " already implying " -D " internally, but it is non-obvious because " lmv_stripe_offset:-1 " is actually the default value, so "deleting" this layout didn't help. I would also expect " lfs setdirstripe -D -i -1 " to set the default layout to create remote directories, matching how " lfs setstripe " works. There were other users confused by this as well. The missing part is that specifying only " -i -1 " is internally using the same as " -c 0 " which actually results in the existing layout to be reset to the default (local directory creation). I'll push a small patch that makes " -D -i -1 " set " -c 1 " internally if the stripe count is not specified, so that it doesn't result in unexpected behavior for the user. Another issue is that the default " qos_threshold_rr=17% " is too high to start balancing directory creation across MDTs. This might mean that MDT0000 is used for many millions of files and top-level directories before any balancing is even started. At that point it will be very difficult to return the balance of the MDTs because so many top-level directories and subdirectories have been created on MDT0000. I think it would be better to start space balancing and/or round-robin MDT selection for root directory entries right away if " lmv_stripe_count:1 lmv_stripe_offset:-1 " is set on " ROOT/ " (which I think we should make the default for 2.14). If there is only a single MDT then this is no change to behavior, but for multiple MDTs it will start using all MDTs right away at the root level and prevent the MDTs from becoming unbalanced in the first place. If we special-case MDT0000 to be RR/balanced immediately, then a smaller qos_threshold_rr=5% may still be useful to avoid the MDTs becoming too imbalanced, but will be less likely to be needed.
            laisiyao Lai Siyao added a comment -

            It's because 'lfs setdirstripe -D -i -1 <dir>' is used to delete default stripe, because when both "mdt_index" and "mdt_count" are unset, it's treated as removal. You need to use 'lfs setdirstripe -D -i -1 -c 1' to enable balanced subdirectories creation for plain directories.

            laisiyao Lai Siyao added a comment - It's because 'lfs setdirstripe -D -i -1 <dir>' is used to delete default stripe, because when both "mdt_index" and "mdt_count" are unset, it's treated as removal. You need to use 'lfs setdirstripe -D -i -1 -c 1' to enable balanced subdirectories creation for plain directories.

            People

              laisiyao Lai Siyao
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: