[LU-11213] DNE3: remote mkdir() in ROOT/ by default Created: 04/Aug/18  Updated: 07/Feb/22  Resolved: 20/Sep/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Improvement Priority: Major
Reporter: Andreas Dilger Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: dne3

Attachments: Microsoft Word Lustre Feature Test Plan for DNE3-update.docx    
Issue Links:
Blocker
is blocked by LUDOC-439 Documentation update for DNE remote m... Resolved
Gantt End to End
has to be finished together with LUDOC-442 Documentation needed for LU-11213 DNE... Resolved
Related
is related to LU-11211 Performance degradation in mdtest Open
is related to LU-7827 DNE3: automatically select MDT for lf... Resolved
is related to LU-12434 use filesystem default dir layout for... Resolved
is related to LU-9309 Add ldiskfs 64-bit inode number support Open
is related to LU-12624 DNE3: striped directory allocate stri... Resolved
is related to LU-13417 DNE3: mkdir() automatically create re... Resolved
is related to LU-9435 DNE2 - object placement QoS policy Resolved
is related to LU-10784 DNE3: mkdir() automatically create re... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

In order to start using multiple MDTs on a filesystem automatically, I wonder if it makes sense to start creating any directories under the filesystem ROOT/ directory remotely by default? This could be a client parameter that can be disabled if desired, but as can be seen in LU-11211 it is not very clear to users how to get multiple MDTs active by default. With OST object allocation we already do this by default for all objects.

While there is some overhead to creating remote directories, this top-level directories are best placed to allow maximal distribution across all of the MDTs. This would also work as new MDTs are added to the filesystem.

Using remote directories at the top level is a lot easier than trying to create a striped directory for ROOT/, since that would also need to change as new MDTs are added to the filesystem. It also allows more dynamic load balancing across MDTs.

As for implementation, the client tunable might be something like "llite.*.mdt_distribution_enabled" to turn it on/off, and "llite.*.mdt_distribution_seconds", which is how often the client will do obd_statfs() to the MDTs to get space/inode info, so that it doesn't need to do this for every create. Once per 5s would be the minimum, maybe as long as 60s?

The ROOT/ directory would get a flag like LUSTRE_TOPDIR_FL/LMAI_TOPDIR set when it is created that indicates the directory is a top-level directory and should have automatic remote directories created in it. This can be set manually by users on other directories with the "chattr +T" command, but it would not be inherited by subdirectories.

Using the same algorithms and tunables as LOV QOS for MDT selection would help leverage the same code, and allow us to improve the code together in the future.



 Comments   
Comment by Andreas Dilger [ 05/Aug/18 ]

Another way to do this that might be easier for Lustre users to handle is to allow setting "lfs setdirstripe -D -c1 -i -1 /mnt/lustre/remotedir" that will cause all new subdirectories to be created on the "best" MDT index, as we do with regular OST selection. Typically this should be round-robin if the MDT available space is relatively close, but will select a less full MDT if the space gets out of balance.

Comment by Lai Siyao [ 06/Aug/18 ]

This can be simpler in this way: in ll_mkdir(), if parent is ROOT, fake an lmv_user_md which specifies a less empty MDT to create this directory. And this can ever be applied on general mkdir which is not under ROOT, but user may find mkdir is slow, so it's better to let directory restripe to balance MDT usage for directory not under ROOT.

Comment by Andreas Dilger [ 08/Aug/18 ]

The one issue with checking for ROOT/ is that this will potentially have problems with subdirectory mounts with or without nodemaps, where a client is mounting only a subdirectory and doesn't see ROOT/ at all. Having an explicit flag or a layout on the ROOT/ or other directory (that can be set/removed by the user) is much more flexible and doesn't really add more complexity.

In theory, we could always set the equivalent of lfs setstripe -i 1 -c -1 on the ROOT/ directory at format time, so that top-level directories are distributed across MDTs by default. Even if there was only a single MDT to start with, this would allow load distribution to other MDTs when they are added.

We might want to consider adding a "no inherit" mode for this kind of layout, so that it doesn't propagate to new subdirectories.

Comment by Lai Siyao [ 29/Dec/18 ]

Hi Andreas, the issue with setting this in default directory stripe is that we can't both have global default dir stripe and this feature in the same time. And IMHO this is not needed for non-ROOT directory because non-ROOT directory can be a striped directory, so subdirectories are distributed by default. So fundamentally this feature is introduced is because ROOT in Lustre DNE system is not striped.

Comment by Andreas Dilger [ 29/Dec/18 ]

I think there are two separate issues here:

  • being able to set the dirstripe index for a directory to be -1, which allows balancing new subdirectories across MDTs. This may be useful for directories other than the root directory. While it is similar to a striped directory, because it is not a strict function of the filename it has the benefit that it can balance usage across MDTs better, and it also is not tied to a "stripe count" for a specific number of MDTs.
  • wanting to have the "-1" stripe index on the root directory, but setting a different "default" dirstripe for the rest of the filesystem.

I don't think these two behaviors are incompatible. The "-1" index behavior is a property of the directory itself, essentially like a hashed layout that can be added on an existing plain directory that affects new subdirectory entries in the directory itself. That is not the same as the default directory layout that affects how the subdirectory will be created (eg. stripe count/hash function of the subdirectory).

In a regular striped directory, the location of the subdirectory is based on the name and the hash function strored in the LMV of the directory itself. In the proposed change, we would add a new hash function that means "ignore the filename and use the 'best' (usually least full) MDT for the new subdirectory". We would want to be able to set such a layout on an existing directory (in particular the ROOT directory), but it would still be possible for those directories to inherit the default directory layout from the parent as well (eg. stripe_count).

Comment by Lai Siyao [ 01/Jan/19 ]

The problem is that the global default stripe is now stored as the default stripe of ROOT, and we can't store two default stripe there. One option is to move the global default stripe somewhere else, but it needs taking care of backward compatibility; or we use the original proposal to store this as LUSTRE_TOPDIR_FL/LMAI_TOPDIR, and alter it via chattr.

Comment by Andreas Dilger [ 02/Jan/19 ]

Is that true for directory default stripe or only file default stripe? Also, the directory default stripe should be inherited from the LMV dirstripe default and not the actual LMV layout of the directory.

Comment by Lai Siyao [ 02/Jan/19 ]

It's true for both file and directory default stripe. And it's also true for directory layout, so by default if parent doesn't have default LMV dirstripe, subdirectories are created as plain directory, and doesn't have default LMV either.

Comment by Andreas Dilger [ 02/Jan/19 ]

I'm still confused. There is the LMV default layout stores on a directory, which is inherited by newly created directories, which can be set on the root directory and is the default for the whole filesystem. I'm not necessarily suggesting to always change the default layout of the directory, though this seems natural if lmv_stripe_index = -1 in the default layout.

Separately, there is the LMV layout of the directory itself, which describes how new filenames created in that directory are mapped to MDTs (normally via a filename hash) and should not be inherited. One option is to create a new hash type/flag (eg. LMV_HASH_FLAG_REMOTE or LMV_HASH_FLAG_BALANCED) that can be set on an existing single-stripe directory, similar to LOV_PATTERN_F_HOLE and LOV_PATTERN_F_RELEASED on regular files. This will not affect where the names of new directories are created, but the directories themselves may be located on remote MDTs if the space is imbalanced.

Comment by Lai Siyao [ 03/Jan/19 ]

Okay, I see what you mean, I thought it's not allowed to set LMV on an existed directory, because currently setxattr(LMV) is used to create remote/striped directory, but by checking these special flags we can store LMV on plain directories.

Comment by Gerrit Updater [ 01/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34357
Subject: LU-11213 uapi: add plain layout connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 04aad90949db53d4005af58b35d6b06bafb42536

Comment by Gerrit Updater [ 01/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34358
Subject: LU-11213 layout: add plain layout command and RPC
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f4f8a2fd13deb0ecc7b2b3aa79a090e5ae3c48d6

Comment by Gerrit Updater [ 01/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34359
Subject: LU-11213 lmv: add async statfs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a4a191ce52209acf21894117b2643363de4284f4

Comment by Gerrit Updater [ 01/Mar/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34360
Subject: LU-11213 lmv: distribute subdir by plain layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 76ede2ca7ba77c5a8ca0b3127e36c5dd07e24830

Comment by Gerrit Updater [ 14/Apr/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34656
Subject: LU-11213 uapi: reserve connect flag for plain layout
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 28c29ca9f1f2f2b027010da4b90fc498fbcfe63b

Comment by Gerrit Updater [ 14/Apr/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34657
Subject: LU-11213 lmv: reuse object alloc QoS code from LOD
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 07cf35c82f3f9511fd49b6553f9c8041e94d2e5e

Comment by Gerrit Updater [ 18/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34656/
Subject: LU-11213 uapi: reserve connect flag for plain layout
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 14ee65e77bdcf390568f81fad7505048f69bebe9

Comment by Gerrit Updater [ 04/May/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34801
Subject: LU-11213 uapi: revert plain layout connect flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 29ba9891419282688bd11458288017a7c6595926

Comment by Gerrit Updater [ 04/May/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34802
Subject: LU-11213 ptlrpc: intent_getattr fetches default LMV
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ae4babe49c13ce868ad7f25823c61d83df9fbd26

Comment by Gerrit Updater [ 04/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34358/
Subject: LU-11213 dne: add new dir hash type "space"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a24f6153292753bf6e40f5638930d6cffa78e1ac

Comment by Gerrit Updater [ 07/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34802/
Subject: LU-11213 ptlrpc: intent_getattr fetches default LMV
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 55ca00c3d1cd8635258ccbda27ee3f0f9b2966a8

Comment by Gerrit Updater [ 07/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34359/
Subject: LU-11213 mdc: add async statfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7f412954ad38b540ccbbf241b25c594faf5ca79d

Comment by Gerrit Updater [ 07/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34360/
Subject: LU-11213 lmv: mkdir with balanced space usage
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6d296587441d80588340200903027ac4231922cd

Comment by Gerrit Updater [ 12/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35207
Subject: LU-11213 lod: default LMV can't be deleted
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f97f099fa0569e2dc0da3910334dbcda18497b3e

Comment by Gerrit Updater [ 12/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35208
Subject: LU-11213 doc: update lfs-setdirstripe man 'space' hash
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c68307895183090ea7ab89208d192c0276cfe2b5

Comment by Andreas Dilger [ 12/Jun/19 ]

I was thinking about how we might best use this feature in the field. It definitely makes sense to set it on the root directory (at some point by default, but not yet). However, one concern that I have is that (AFAIK) remote directories created by this feature will have fnv-1a hash type set on them (ie. they will be striped directories)? I don't think that is something that we want, but rather to make the new directories be unstriped.

It may also be useful to allow specifying an "inherit_depth=N", so that if set on the root directory it will create the next N subdirectories with space-hash (eg. inherit_depth=1 on root sets space-hash for the top-level user directories, but lower-level directories revert to plain dirs). Using this avoids the situation where one user has a lot of files on their initial MDT, but their subdirectories don't get distributed further across MDTs, but we don't want the overhead of remote directories at every level. If we want all directories to be remote, then set "inherit_depth=-1" (probably an 8-bit field would be enough, given PATH_MAX limits), and maybe special-case the "-1" value so it is never decremented. This would also be useful for striped directories I think.

Comment by Gerrit Updater [ 13/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34657/
Subject: LU-11213 lmv: reuse object alloc QoS code from LOD
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b601eb35e97ae74d3de448e07a28ce41afb4adef

Comment by Gerrit Updater [ 13/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35218
Subject: LU-11213 lmv: use lu_tgt_descs to manage tgts
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 57227c73a9f891ac6f44e3bf5eb578208c323040

Comment by Gerrit Updater [ 13/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35219
Subject: LU-11213 lod: share object alloc QoS code with LMV
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b42209a3abb44180938eafd6502e430bbba775b

Comment by Gerrit Updater [ 20/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35208/
Subject: LU-11213 doc: update lfs-setdirstripe man 'space' hash
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8c914986bf415a535153f77e53e630707f8e1a1c

Comment by Gerrit Updater [ 20/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35207/
Subject: LU-11213 lod: default LMV can't be deleted
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 36243c75bd55c9e4d24f4e592fe03c92b56baedd

Comment by Gerrit Updater [ 25/Jun/19 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35318
Subject: LU-11213 uapi: change "space" hash type to hash flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a8bb3df4d894340e58859270bfd516cc3aabd73c

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35318/
Subject: LU-11213 uapi: change "space" hash type to hash flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c605ef1dbeb401bdff5ab3cfa1c407ea87a7b95d

Comment by Gerrit Updater [ 30/Aug/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36008
Subject: LU-11213 uapi: Remove unused CONNECT flag
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e06df46d294b87e28071708f2f21e7bca26d0e8c

Comment by Gerrit Updater [ 07/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36008/
Subject: LU-11213 uapi: Remove unused CONNECT flag
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 11eba11fe0457b735194e5924e7bb1882a5b31b8

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35218/
Subject: LU-11213 lmv: use lu_tgt_descs to manage tgts
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 59fc1218fccf1a826182ff7cd52321e3efbb1eab

Comment by Gerrit Updater [ 20/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35219/
Subject: LU-11213 lod: share object alloc QoS code with LMV
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d3090bb2b4860e997730e90426e11fc51ee27c0c

Comment by Peter Jones [ 20/Sep/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 07/Feb/22 ]

"Oleg Drokin <green@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46476
Subject: LU-11213 llite: Remove old compat code
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f73bd88bf81f6ab92648716217b638cc793fbb53

Generated at Sat Feb 10 02:41:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.