[LU-2871] Data can't be striped across all the OSTs correctly by running "lfs setstripe -c -1 -i n" (n>0) Created: 26/Feb/13  Updated: 15/Mar/13  Resolved: 15/Mar/13

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Emoly Liu Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Duplicate
is duplicated by LU-2893 OST index 0 never used when offset is... Closed
Related
is related to LU-2809 fix ll_setxattr() to always ignore ll... Resolved
Severity: 3
Rank (Obsolete): 6940

 Description   

I found this problem during the investigation on LU-2809. While running "lfs setstripe -c -1 -i n testfile", if ost index n doesn't start from 0, it shows that data can't be striped across all the OSTs and OST0 is always ignored.



 Comments   
Comment by Emoly Liu [ 26/Feb/13 ]

OSTCOUNT=4,

[root@centos6-1 ~]# cd /mnt/lustre
[root@centos6-1 lustre]# mkdir test;cd test
[root@centos6-1 test]# for i in 0 1 2 3; do lfs setstripe -i $i -c -1 testfile$i;dd if=/dev/zero of=testfile$i bs=2M count=5;done
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0220632 s, 475 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0193019 s, 543 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0200823 s, 522 MB/s
5+0 records in
5+0 records out
10485760 bytes (10 MB) copied, 0.0190184 s, 551 MB/s
[root@centos6-1 test]# lfs getstripe *
testfile0
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  0
	obdidx		 objid		objid		 group
	     0	             1	          0x1	             0
	     1	             1	          0x1	             0
	     2	             1	          0x1	             0
	     3	             1	          0x1	             0

testfile1
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  1
	obdidx		 objid		objid		 group
	     1	             2	          0x2	             0
	     2	             2	          0x2	             0
	     3	             2	          0x2	             0
	     1	             3	          0x3	             0

testfile2
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  2
	obdidx		 objid		objid		 group
	     2	             3	          0x3	             0
	     3	             3	          0x3	             0
	     1	             4	          0x4	             0
	     2	             4	          0x4	             0

testfile3
lmm_stripe_count:   4
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  3
	obdidx		 objid		objid		 group
	     3	             4	          0x4	             0
	     1	             5	          0x5	             0
	     2	             5	          0x5	             0
	     3	             5	          0x5	             0
Comment by Emoly Liu [ 27/Feb/13 ]

I Added some debug messages,

diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c
index 17dca0c..851caee 100644
--- a/lustre/lod/lod_dev.c
+++ b/lustre/lod/lod_dev.c
@@ -68,7 +68,10 @@ int lod_fld_lookup(const struct lu_env *env, struct lod_device *lod,
 
        LASSERTF(fid_is_sane(fid), "Invalid FID "DFID"\n", PFID(fid));
        if (fid_is_idif(fid)) {
+               printk("before ostindex:%d, FID "DFID"\n",
+                      cpu_to_le32(*tgt), PFID(fid));
                *tgt = fid_idif_ost_idx(fid);
+               printk("after ostindex:%d\n", cpu_to_le32(*tgt));
                RETURN(rc);
        }
diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
index f6dad39..39decd5 100644
--- a/lustre/osd-ldiskfs/osd_handler.c
+++ b/lustre/osd-ldiskfs/osd_handler.c
@@ -2206,6 +2206,20 @@ static inline int __osd_xattr_set(struct osd_thread_info *info,
 
        ll_vfs_dq_init(inode);
        dentry->d_inode = inode;
+       if (strcmp(name, XATTR_NAME_LOV) == 0) {
+               struct lov_mds_md_v1 *lmm = (struct lov_mds_md_v1 *)buf;
+               int stripe_count = lmm->lmm_stripe_count;
+               struct lov_ost_data *objects = lmm->lmm_objects;
+               int i;
+
+               printk("stripecount=%d, stripesize=%d\n",
+                      stripe_count, lmm->lmm_stripe_size);
+               for (i = 0; i < stripe_count; i++) {
+                      int idx = objects[i].l_ost_idx;
+                      printk("here ostindex:%d\n", idx);
+               }
+       }
        return inode->i_op->setxattr(dentry, name, buf, buflen, fl);
 }

and dmesg showed

Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
before ostindex:-30720, FID [0x100000000:0x1:0x0]
after ostindex:0
before ostindex:0, FID [0x100010000:0x1:0x0]
after ostindex:1
before ostindex:1, FID [0x100020000:0x1:0x0]
after ostindex:2
before ostindex:2, FID [0x100030000:0x1:0x0]
after ostindex:3
stripecount=4, stripesize=1048576
here ostindex:0
here ostindex:1
here ostindex:2
here ostindex:3
before ostindex:-30720, FID [0x100010000:0x2:0x0]
after ostindex:1
before ostindex:1, FID [0x100020000:0x2:0x0]
after ostindex:2
before ostindex:2, FID [0x100030000:0x2:0x0]
after ostindex:3
before ostindex:3, FID [0x100010000:0x3:0x0]
after ostindex:1
stripecount=4, stripesize=1048576
here ostindex:1
here ostindex:2
here ostindex:3
here ostindex:1
before ostindex:-30720, FID [0x100020000:0x3:0x0]
after ostindex:2
before ostindex:2, FID [0x100030000:0x3:0x0]
after ostindex:3
before ostindex:3, FID [0x100010000:0x4:0x0]
after ostindex:1
before ostindex:1, FID [0x100020000:0x4:0x0]
after ostindex:2
stripecount=4, stripesize=1048576
here ostindex:2
here ostindex:3
here ostindex:1
here ostindex:2
before ostindex:-30720, FID [0x100030000:0x4:0x0]
after ostindex:3
before ostindex:3, FID [0x100010000:0x5:0x0]
after ostindex:1
before ostindex:1, FID [0x100020000:0x5:0x0]
after ostindex:2
before ostindex:2, FID [0x100030000:0x5:0x0]
after ostindex:3
stripecount=4, stripesize=1048576
here ostindex:3
here ostindex:1
here ostindex:2
here ostindex:3

There is something wrong with fid sequence.

Comment by Zhenyu Xu [ 27/Feb/13 ]

I found the root cause.

in lod_qos_ost_in_use_clear(), the ost_in_use array is initialised to 0, and in lod_qos_prep_create()->old_alloc_specific(), the ost_idx is

        for (i = 0; i < ost_count;
                        i++, array_idx = (array_idx + 1) % ost_count) {
                ost_idx = osts->op_array[array_idx];

and the ost_idx will be checked upon ost_in_use array

                if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                        continue;

If the stripe_offset starts from 0, and in the 1st iteration, stripe_num is also 0, and lod_qos_is_ost_used() will return false, then object will be allocated on the first OST device.

While if file stripe starting from a number other than 0, when the loop comes to which ost_idx is 0, the lod_qos_is_ost_used(env, 0, stripe_num) will return true, and the 1st OST device will be skipped.

The fix should be in lod_qos_ost_in_use_clear(). With following patch, the object stripe allocation will be correct.

diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index 2b81ad8..2f46e7c 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -629,7 +629,7 @@ static inline int lod_qos_ost_in_use_clear(const struct lu_env *env, int stripes
                CERROR("can't allocate memory for ost-in-use array\n");
                return -ENOMEM;
        }
-       memset(info->lti_ea_store, 0, sizeof(int) * stripes);
+       memset(info->lti_ea_store, -1, sizeof(int) * stripes);
        return 0;
 }
Comment by Alex Zhuravlev [ 27/Feb/13 ]

pretty much correct. please put a patch into gerrit, thanks.

Comment by Emoly Liu [ 27/Feb/13 ]

Another way, run lod_qos_ost_in_use() after lod_qos_is_ost_used() check, right?

diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
index 2b81ad8..92b3b36 100644
--- a/lustre/lod/lod_qos.c
+++ b/lustre/lod/lod_qos.c
@@ -887,6 +887,7 @@ repeat_find:
                 */
                if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                        continue;
+               lod_qos_ost_in_use(env, stripe_num, ost_idx);
 
                /* Drop slow OSCs if we can, but not for requested start idx.
                 *
Comment by Emoly Liu [ 27/Feb/13 ]

I will add a sanity test for this case.

Comment by Emoly Liu [ 28/Feb/13 ]

Patch is at http://review.whamcloud.com/5554

Comment by Alex Zhuravlev [ 01/Mar/13 ]

liuying, it's better to mark index used after successful lod_qos_declare_object_on(). and I don't think this is an option to the change by Zhenyu Xu, I think the both changes should be applied.

Comment by Emoly Liu [ 01/Mar/13 ]

Sure, I made the both changes in the patch and will update it per Ned Bass' advice later. Thanks!

Comment by Emoly Liu [ 15/Mar/13 ]

Landed for 2.4

Generated at Sat Feb 10 01:28:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.