[LU-2809] fix ll_setxattr() to always ignore ll_stripe_offset Created: 13/Feb/13  Updated: 13/Mar/13  Resolved: 13/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Andreas Dilger Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Related
is related to LU-2871 Data can't be striped across all the ... Closed
is related to LU-2182 Add llapi_file_get_layout() function ... Closed
Severity: 3
Rank (Obsolete): 6799

 Description   

With the introduction of lum_layout_version field to lov_user_md, using "tar --xattr" to backup and restore the Lustre layout (LOV EA) will incorrectly pass the lum_layout_version value back to ll_setxattr() as lum_stripe_offset. This will incorrectly turn the layout version into the starting OST index, for cases where the version is within [1,max_ost_idx].

This may cause, for example, all files that have been migrated once to be restored by tar onto OST1.



 Comments   
Comment by Ned Bass [ 15/Feb/13 ]

With this we lose the ability to set the offset of stripe 0 through the fsetxattr() interface. Andreas, in LU-2182 you mentioned the possibility to add support for setting OST indexes of all stripes. Is there an issue to track that effort? What would the interface be, i.e. letting user pass in a lov_user_ost_data_v1[] array with the lum data?

Comment by Andreas Dilger [ 15/Feb/13 ]

Actually, I have an options that would avoid the present conflict - change the HSM lum_layout_gen code to keep the version larger than LOV_MAX_STRIPE_COUNT instead of incrementing from 0, so that it does not conflict when setting lum_stripe_offset. That way, for applications like tar that save/restore the same xattr the large lum_layout_gen value will be treated as "-1" (i.e. unspecified striping offset), while small values (generated from userspace and not returned from getxattr()) will be treated as actual OST index values. A bit ugly, but workable. So far, the only code I can see that manipulates the layout generation is in mdd_swap_layouts():

static int mdd_swap_layouts(const struct lu_env *env, struct md_object *obj1,
                            struct md_object *obj2, __u64 flags)
{
        if (fst_buf) {          
                fst_lmm = fst_buf->lb_buf;
                fst_gen = le16_to_cpu(fst_lmm->lmm_layout_gen);
        }        
        if (snd_buf) {
                snd_lmm = snd_buf->lb_buf;
                snd_gen = le16_to_cpu(snd_lmm->lmm_layout_gen);
        }

        /* increase the generation layout numbers */
        snd_gen = max(snd_gen + 1, LOV_MAX_STRIPE_COUNT);
        fst_gen = max(fst_gen + 1, LOV_MAX_STRIPE_COUNT);

        /* set the file specific informations in lmm */
        if (fst_lmm) {
                fst_lmm->lmm_layout_gen = cpu_to_le16(snd_gen);
        }

        if (snd_lmm) {
                snd_lmm->lmm_layout_gen = cpu_to_le16(fst_gen);
        }

This would work for the short term, but would become problematic if we wanted to have the full range of 64k OSTs, and is more of a "hack" solution.

Another alternative, as Ned suggests, is to use lmm_objects[0].l_ost_idx to specify the striping layout, but this would cause "tar" to restore all objects to the original OST index and not take current OST counts into account (i.e. no rebalancing would happen during restore, which could cause OST imbalance). Inside ll_setxattr() it could distinguish these cases by using a different lum_magic value for the new API and converting it inside ll_setxattr() so there is no network protocol incompatibility.

Thoughts?

Comment by Ned Bass [ 15/Feb/13 ]

The first option does make me cringe a bit, though it seems like the simpler solution to implement. But if we're going to put in the effort I'd rather implement something we can live with long term, so I'd vote for the second option.

Using lum_magic to distinguish the cases is a little icky too, but I don't see a better alternative. We could leverage the fact that the existing lum_magic numbers have a 0 MSB to embed a flag to signal that lmm_objects OST data should be honored. This would provide forward-compatibility with future LOV_USER format versions. i.e.

lustre_user.h
#define LOV_USER_MAGIC_V1 0x0BD10BD0
#define LOV_USER_MAGIC    LOV_USER_MAGIC_V1
#define LOV_USER_MAGIC_JOIN_V1 0x0BD20BD0
#define LOV_USER_MAGIC_V3 0x0BD30BD0
#define LOV_USER_MAGIC_V4 0x0BD40BD0 /* someday */

#define LOV_USER_OST_INDEX_FL 0x10000000 /* Honor OST indexes supplied with lum */
ll_setxattr()
if (lump->lum_magic & LOV_USER_OST_INDEX_FL) {
    lump->lum_magic &= ~LOV_USER_OST_INDEX_FL;
    /* ... */
} else {
    /* normal processing */
}
Comment by jacques-charles lafoucriere [ 16/Feb/13 ]

The version field in lum (lum_magic) is exactly done to manage such structure change case. So the 2nd case is better. With this change we can introduce an explicit conversion of lum to lustre internal/wire structures (replace the type cast). The conversion lum->lsm->lmm is more complex than lum->lmm but I find it better for future.

Comment by Jodi Levi (Inactive) [ 21/Feb/13 ]

Emoly,
Could you please have a look at this one?

Comment by Emoly Liu [ 22/Feb/13 ]

OK, I will look into this one.

Comment by Emoly Liu [ 25/Feb/13 ]

During the test, I found another problem.

OSTCOUNT=4,

[root@centos6-1 ~]# cd /mnt/lustre
[root@centos6-1 lustre]# mkdir test;cd test
[root@centos6-1 test]# for i in 0 1 2 3; do lfs setstripe -i $i -c -1 testfile$i;dd if=/dev/zero of=testfile$i bs=2M count=5;done
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0220632 s, 475 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0193019 s, 543 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0200823 s, 522 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0190184 s, 551 MB/s
[root@centos6-1 test]# lfs getstripe *
testfile0
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		objid		 group
	     0	             1	          0x1	             0
	     1	             1	          0x1	             0
	     2	             1	          0x1	             0
	     3	             1	          0x1	             0

testfile1
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  1
	obdidx		 objid		objid		 group
	     1	             2	          0x2	             0
	     2	             2	          0x2	             0
	     3	             2	          0x2	             0
	     1	             3	          0x3	             0

testfile2
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		objid		 group
	     2	             3	          0x3	             0
	     3	             3	          0x3	             0
	     1	             4	          0x4	             0
	     2	             4	          0x4	             0

testfile3
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  3
	obdidx		 objid		objid		 group
	     3	             4	          0x4	             0
	     1	             5	          0x5	             0
	     2	             5	          0x5	             0
	     3	             5	          0x5	             0

When stripe_offset is not 0, even stripe_count = -1, data can't be striped over all the OSTs and OST0 is always excluded.

I reported this bug in LU-2871 .

Comment by Emoly Liu [ 25/Feb/13 ]

When I ran "tar --xattrs" with ./test and untar to a new dir, the data can be striped over all the OSTs.

[root@centos6-1 lustre]# tar --xattrs -czvf /tmp/test.tar.gz ./test
./test/
./test/testfile1
./test/testfile0
./test/testfile3
./test/testfile2
[root@centos6-1 lustre]# mkdir unzip; cd unzip
[root@centos6-1 unzip]# tar -zxvf /tmp/test.tar.gz
./test/
./test/testfile1
./test/testfile0
./test/testfile3
./test/testfile2
[root@centos6-1 unzip]# lfs getstripe ./test
./test
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1 
./test/testfile1
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		objid		 group
	     2	             6	          0x6	             0
	     3	             6	          0x6	             0
	     0	             2	          0x2	             0
	     1	             6	          0x6	             0

./test/testfile0
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  3
	obdidx		 objid		objid		 group
	     3	             7	          0x7	             0
	     0	             3	          0x3	             0
	     1	             7	          0x7	             0
	     2	             7	          0x7	             0

./test/testfile3
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		objid		 group
	     0	             4	          0x4	             0
	     1	             8	          0x8	             0
	     2	             8	          0x8	             0
	     3	             8	          0x8	             0

./test/testfile2
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  1
	obdidx		 objid		objid		 group
	     1	             9	          0x9	             0
	     2	             9	          0x9	             0
	     3	             9	          0x9	             0
	     0	             5	          0x5	             0
Comment by Andreas Dilger [ 27/Feb/13 ]

Emoly, this is probably a bug in the new lustre/lod/lod_qos.c code. I suspect either OST0 didn't have any objects or was inactive, and it was skipped during object allocation, then lod_qos_prep_create() is adding the same OST twice. The old LOV behaviour allowed files to be created with >= 3/4 of the requested stripes, if OSTs are unavailable, but would normally have blocked to wait for the OST without objects if the OST was just in the progress of creating objects.

Comment by Emoly Liu [ 05/Mar/13 ]

I did a test to reproduce the problem in the description by setting lum_layout_gen explicitly.

diff --git a/lustre/utils/liblustreapi.c b/lustre/utils/liblustreapi.c
index 4067bd4..d0991c0 100644
--- a/lustre/utils/liblustreapi.c
+++ b/lustre/utils/liblustreapi.c
@@ -582,6 +582,7 @@ int llapi_file_open_pool(const char *name, int flags, int mode,
         lum.lmm_stripe_size = stripe_size;
         lum.lmm_stripe_count = stripe_count;
         lum.lmm_stripe_offset = stripe_offset;
+       lum.u.lum_layout_gen = 2;
         if (pool_name != NULL) {
                 strncpy(lum.lmm_pool_name, pool_name, LOV_MAXPOOLNAME);
         } else {

When running "lfs setstripe -c -1 -i $i /mnt/lustre/testfile$i" on 4 OSTs, the output is

testfile0
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		 objid		 group
	     2	            37	         0x25	             0
	     3	            37	         0x25	             0
	     0	            37	         0x25	             0
	     1	            38	         0x26	             0

testfile1
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		 objid		 group
	     2	            38	         0x26	             0
	     3	            38	         0x26	             0
	     0	            38	         0x26	             0
	     1	            39	         0x27	             0

testfile2
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		 objid		 group
	     2	            39	         0x27	             0
	     3	            39	         0x27	             0
	     0	            39	         0x27	             0
	     1	            40	         0x28	             0

testfile3
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		 objid		 group
	     2	            40	         0x28	             0
	     3	            40	         0x28	             0
	     0	            40	         0x28	             0
	     1	            41	         0x29	             0

It shows stripe_offset is overwritten by lum_layout_gen.

Comment by Emoly Liu [ 05/Mar/13 ]

It's improper to set lum_layout_gen explicitly because stripe_offset and layout_gen share in a union, for writing and reading separately.

I'm just wondering how to reproduce this problem. I try to use "tar --xattrs" mentioned in description, but even I set "lum.u.lum_layout_gen = 2", after untar, layout_gen turns back to zero and it isn't misread as stripe_offset.

Comment by Andreas Dilger [ 05/Mar/13 ]

Emoly, I think your reproducer exactly demonstrates the problem - if lum_layout_gen is non-zero it will be confused with lum_stripe_offset.

Ned, I think your solution is interesting, and is also related to LOV_MAGIC_V[13]_DEF magic that already exists in the code. Should this new flag be implemented in the context of LU-2182?

Comment by Ned Bass [ 05/Mar/13 ]

Emoly, you can reproduce the problem without modifying liblustreapi by swapping the layout of two files with the llapi_swap_layouts() function, which will force their generation numbers to increment. Then tarring and untarring them with --xattr will force their offset to be at index 1.

Comment by Ned Bass [ 05/Mar/13 ]

Andreas, yes we could do this in the context on LU-2182. I was thinking we could handle the simple case first, which is to set the offset of stripe 0. llapi_layout_ost_index_set() would store the OST index of stripe N in lum.lmm_objects[N].l_ost_idx. For now, ll_setxattr would copy lmm_objects[0].l_ost_idx to lum.lum_stripe_offset if lmm_magic has the flag set. This approach would let us match the functionality that exists today without changing any server code. Eventually we can extend this to support stripes > 0 without changing the API, but that would be more invasive (i.e. requiring changes to lod). Does this approach sound reasonable?

Comment by Emoly Liu [ 06/Mar/13 ]

Ned, thanks for reminder! We have "lfs swap_layouts".

Comment by Andreas Dilger [ 06/Mar/13 ]

Ned, yes this does seem like a good path forward. There have been requests to allow specifying the exact OST index for all of the stripes.

Comment by Jinshan Xiong (Inactive) [ 07/Mar/13 ]

hmm.. do you think it's a good idea to pack layout_gen into low 12bit of lmm_stripe_size. Usually we don't need to know the exact version of layout_gen because it's only used to detect layout change. That being said, we don't need to keep layout_gen when restoring a file.

Comment by Jinshan Xiong (Inactive) [ 08/Mar/13 ]

patch is at: http://review.whamcloud.com/5664

In the above patch, layout_gen is cleared for getxattr

Comment by Ned Bass [ 11/Mar/13 ]

Jinshan, my reservation about that approach is that you are removing functionality from the getxattr interface. On platforms that use function shipping, [gs]etxattr may be the only option because the needed ioctls are not supported. Also, having inconsistent semantics between ioctl and getxattr is confusing.

Having said all that, I don't know whether applications on function-shipping clients would ever need to know they layout generation number. But in general it seems better to leave open the possibility in case we need it later.

Comment by Jinshan Xiong (Inactive) [ 11/Mar/13 ]

getxattr should take the returning data as opaque unless the extended attributes are system defined. Ioctl and getxattr are already two separate interfaces so I tend to think it's okay to return different data, if we have defined the return value clearly.

I worked out this patch because it'd be the simplest way to keep compatibility. Also this solution won't last long because v4 version of layout is on its way, at that time, layout gen will no longer share field with others.

Comment by Ned Bass [ 11/Mar/13 ]

Also this solution won't last long because v4 version of layout is on its way, at that time, layout gen will no longer share field with others.

In that case I agree with your solution. Thanks

Generated at Sat Feb 10 01:28:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.