[LU-7660] FS default striping settings only honored on MDT 0 Created: 13/Jan/16  Updated: 13/Oct/21  Resolved: 13/Jul/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.7.0, Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: John Hammond Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: dne

Issue Links:
Duplicate
is duplicated by LU-8264 lfs setstripe without -p pool_name do... Resolved
Gantt End to Start
has to be done before LU-8159 cache xattr in ldiskfs OSD Resolved
Related
is related to LU-8092 racy striping & default striping cach... Open
is related to LU-7661 MGS_SET_INFO handler is too permissive Resolved
is related to LU-8454 non-root user is able to change strip... Resolved
is related to LU-5676 DNE 2: cache LMV EA in LOD Resolved
is related to LU-7813 default pool not inherited when speci... Resolved
is related to LU-8653 broken inheritance of default striping Resolved
is related to LU-7335 store default filesystem layout direc... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

From the manual "Setting the striping specification on the root directory determines the striping for all new files created in the file system unless an overriding striping specification takes precedence (such as a striping layout specified by the application, or set using lfs setstripe, or specified for the parent directory)." In a DNE setup, setting the default striping on the root directory only affects files created on MDT 0. See ll_dir_setstripe():

                ...
                buf = param;
                /* Get fsname and assume devname to be -MDT0000. */
                ll_get_fsname(inode->i_sb, buf, MTI_NAME_MAXLEN);
                strcat(buf, "-MDT0000.lov");
		buf += strlen(buf);

                /* Set root stripesize */
                sprintf(buf, ".stripesize=%u",
			lump ? le32_to_cpu(lump->lmm_stripe_size) : 0);
                rc = ll_send_mgc_param(mgc->u.cli.cl_mgc_mgsexp, param);
                if (rc)
		        GOTO(end, rc);
                ...


 Comments   
Comment by Andreas Dilger [ 14/Jan/16 ]

For the PFL project we were looking to change how default layouts worked anyway, because the existing mechanism of storing the stripe_count, stripe_size, and stripe_offset into lov_desc doesn't work for more complex layouts, such as layouts with OST pools or PFL. My though is to actually store the filesystem default layout xattr on the root directory inode. This will allow arbitrary layout template xattrs to be stored for the whole filesystem. There needs to be a check that the layout template xattr doesn't physically get inherited by all new subdirectories and files in the root directory.

This default layout should also be inherited by other MDTs, but where to store it? We don't really want to physically store the layout on each remote directory, since an update to the filesystem-wide defaults should be inherited everywhere that doesn't have its own explicit default layout. It could be stored on the REMOTE_PARENT_DIR directory as a proxy for the filesystem root, and updated whenever the default is changed? That is at least O(num_mdts) rather than O(num remote directories). Unfortunately, REMOTE_PARENT_DIR doesn't exist for osd-zfs.

Comment by Joseph Gmitter (Inactive) [ 14/Jan/16 ]

Hi Lai,
Can you have a look at this one?
Thanks,
Joe

Comment by John Hammond [ 14/Jan/16 ]

Andreas, I think this is a good excuse to decouple the file striping on the root from the default for the FS. Can we do away with this botch and ask uses to either set the default at format time or to use conf params for each MDT?

Comment by Andreas Dilger [ 15/Jan/16 ]

Lai, before you start any implementation, let's first discuss what the best way to fix this problem is.

John, it is already possible to set the default stripe_count, stripe_size, stripe_offset at format time (see mkfs_opts() and --param=lov.stripe_size and friends), and store them into the lov_desc, but lov_desc is not very flexible. It is already not possible to store a default OST pool name in lov_desc, and this approach will be completely unsuable to store a default PFL layout template in the future. Also, it would require that e.g. mkfs.lustre or friends would need to be able to generate a composite layout.

In this regard, I think it is pretty natural for the administrator to use "lfs setstripe" to specify the layout for the whole filesystem on the root directory. Using "lctl conf_param" would mean that lctl needs to be able to generate a binary composite PFL layout in some manner (i.e. gain all of the setstripe options and layout parsing code) or store all of the options for a complex PFL layout and pass them to the MDTs as named parameters to change /proc settings for the lod at startup time.

Also, what happens with sub-tree mounts? Using the current scheme to set the filesystem default on the filesystem root would be implemented in the same manner, but storing it via conf_param/proc/mkfs wouldn't allow different defaults to be stored within the filesystem.

Extending the existing mechanism to DNE seems to me an issue of passing the default layout from the root MDT to the other MDTs. If the root MDT can determine a remote directory FID on each MDT then it could easily pass on the setstripe info. One option would be to define a special FID for this that each MDT understands to mean "use to store default layout for the filesystem" that MDT0 can use to communicate to the remote MDTs.

Alternately, it would be possible for the remote MDTs to fetch (and cache under XATTR lock) the default layout from the filesystem root directory. This would be practical if using [FID_SEQ_ROOT:FID_OID_ROOT:0], but that isn't used for old filesystems, and would need to be special-cased in the OSP->OUT->OSD path (maybe by MDT0 having an alias in the OI for the root?) since there are no "MDT-level" RPCs between MDSes anymore AFAIK, so a simple MDS_GETATTR(root FID) wouldn't be possible.

Di, Alex, any thoughts on how best to implement this?

Comment by Di Wang [ 15/Jan/16 ]

As you suggested, there are two options, IMHO

1. either spreading the default stripe information by config log, i.e. in ll_dir_setstrip(), client will add (or modify) these config parameters in all MDT's config log, instead of only adding it to MDT0.

2. or spreading the default stripe information by root inode on MDT0, i.e. each MDT will hold the xattr lock of the root inode, once we change these default stripe, other MDT will be notified by lock callback. So when each MDT get the connection request from MDT0, it will enqueue the xattr lock and get these attributes by XATTR_GET(), and release the lock during cleanup.

Personally, I prefer 1, because all of these mechanism are already there, and it also make senses to me to store these default stripe information in the config log.

Comment by Alex Zhuravlev [ 16/Jan/16 ]

I remember we were planning to get rid of configuration in the format of llog commands. plus storing per-fs defaults isn't really different from per-directory one?

Comment by Di Wang [ 16/Jan/16 ]
plus storing per-fs defaults isn't really different from per-directory one?

Not sure I follow here. I thought spreading per-fs defaults is definitely different as inheriting directory default? probably miss your points?

Comment by Alex Zhuravlev [ 16/Jan/16 ]

why is it different? it's just an additional optional storage. kind of similar would be to inherit layout from the parent's parent..

Comment by Di Wang [ 16/Jan/16 ]

you mean inherit fs-default layout on each directory creation? why would we do that?

Comment by Alex Zhuravlev [ 17/Jan/16 ]

I mean that having two different mechanism to store essentially same information doesn't really help. we check one FID to take defaults, just add another FID (predefined) ? the approach with LDLM and regular storage looks great to me. the same way we could store/cache other things (like FLDB).

Comment by Di Wang [ 18/Jan/16 ]

I am not sure what you mean two different mechanisms here? TBH, I do not see any problem to regard default FS striping as the configuration of the filesystem, that said I do not see the problem to spread this by config log. Plus this will only need a few lines of changes. i.e. add config records for all MDTs in ll_dir_setstripe().

The concern I have for extending such changes by the ldlm lock of a special FID on MDT0 is that it would make MDT0 more unique, which might be a problem when we want FS to be functional without MDT0. Though I do not hold my strong opinion here.

Comment by Niu Yawei (Inactive) [ 03/Feb/16 ]

I vote for the second method (set default striping in root inode, and let slave MDTs to fetch it on-demand), since I think config log isn't a perfect place to store a composite layout template and treating root inode setting as fs-wide setting looks natural to me.

Comment by Andreas Dilger [ 03/Feb/16 ]

Ideally, the fetching of the default layout xattr from the root directory can be handled (mostly?) transparently via LOD, and the only real change that is needed is for the other MDTs to know the root FID of the filesystem. Ideally, this would be handled by replicating the root directory (and the default layout xattr) to all MDTs so that it can be load balanced and MDT0000 is not a single point of failure, essentially a form of DNE2 striped directory that is instead (or in addition) mirrored. However, I expect that may be too much work to fix this problem. In the short term, it seems that the only information needed is the root directory FID, and sufficient handling in the LOD/OSP layers (DLM locking and xattr caching in particular) that will allow the MDT to fetch this xattr from MDT0000.

Comment by Gerrit Updater [ 21/Mar/16 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/19041
Subject: LU-7660 dne: support fs default stripe
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aeda66065b36048147c969aac4b53cb28dfaf1f4

Comment by Andreas Dilger [ 17/May/16 ]

Lai, can you please also file a new LU ticket about caching the xattrs in osd-ldiskfs. It would be useful to do a before/after mdtest to see if this is needed or not.

Comment by Lai Siyao [ 18/May/16 ]

LU-8159 has been created for it.

Comment by Gerrit Updater [ 11/Jul/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/19041/
Subject: LU-7660 dne: support fs default stripe
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 226fd401f9d8bfcd1a71bf264d9baef1e0842441

Comment by Joseph Gmitter (Inactive) [ 13/Jul/16 ]

Patch has landed to master for 2.9.0

Generated at Sat Feb 10 02:10:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.