Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2871

Data can't be striped across all the OSTs correctly by running "lfs setstripe -c -1 -i n" (n>0)

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6940

    Description

      I found this problem during the investigation on LU-2809. While running "lfs setstripe -c -1 -i n testfile", if ost index n doesn't start from 0, it shows that data can't be striped across all the OSTs and OST0 is always ignored.

      Attachments

        Issue Links

          Activity

            [LU-2871] Data can't be striped across all the OSTs correctly by running "lfs setstripe -c -1 -i n" (n>0)
            emoly.liu Emoly Liu added a comment -

            Landed for 2.4

            emoly.liu Emoly Liu added a comment - Landed for 2.4
            emoly.liu Emoly Liu added a comment -

            Sure, I made the both changes in the patch and will update it per Ned Bass' advice later. Thanks!

            emoly.liu Emoly Liu added a comment - Sure, I made the both changes in the patch and will update it per Ned Bass' advice later. Thanks!

            liuying, it's better to mark index used after successful lod_qos_declare_object_on(). and I don't think this is an option to the change by Zhenyu Xu, I think the both changes should be applied.

            bzzz Alex Zhuravlev added a comment - liuying, it's better to mark index used after successful lod_qos_declare_object_on(). and I don't think this is an option to the change by Zhenyu Xu, I think the both changes should be applied.
            emoly.liu Emoly Liu added a comment - Patch is at http://review.whamcloud.com/5554
            emoly.liu Emoly Liu added a comment -

            I will add a sanity test for this case.

            emoly.liu Emoly Liu added a comment - I will add a sanity test for this case.
            emoly.liu Emoly Liu added a comment - - edited

            Another way, run lod_qos_ost_in_use() after lod_qos_is_ost_used() check, right?

            diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
            index 2b81ad8..92b3b36 100644
            --- a/lustre/lod/lod_qos.c
            +++ b/lustre/lod/lod_qos.c
            @@ -887,6 +887,7 @@ repeat_find:
                             */
                            if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                                    continue;
            +               lod_qos_ost_in_use(env, stripe_num, ost_idx);
             
                            /* Drop slow OSCs if we can, but not for requested start idx.
                             *
            
            emoly.liu Emoly Liu added a comment - - edited Another way, run lod_qos_ost_in_use() after lod_qos_is_ost_used() check, right? diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c index 2b81ad8..92b3b36 100644 --- a/lustre/lod/lod_qos.c +++ b/lustre/lod/lod_qos.c @@ -887,6 +887,7 @@ repeat_find: */ if (lod_qos_is_ost_used(env, ost_idx, stripe_num)) continue ; + lod_qos_ost_in_use(env, stripe_num, ost_idx); /* Drop slow OSCs if we can, but not for requested start idx. *

            pretty much correct. please put a patch into gerrit, thanks.

            bzzz Alex Zhuravlev added a comment - pretty much correct. please put a patch into gerrit, thanks.
            bobijam Zhenyu Xu added a comment - - edited

            I found the root cause.

            in lod_qos_ost_in_use_clear(), the ost_in_use array is initialised to 0, and in lod_qos_prep_create()->old_alloc_specific(), the ost_idx is

                    for (i = 0; i < ost_count;
                                    i++, array_idx = (array_idx + 1) % ost_count) {
                            ost_idx = osts->op_array[array_idx];
            

            and the ost_idx will be checked upon ost_in_use array

                            if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                                    continue;
            

            If the stripe_offset starts from 0, and in the 1st iteration, stripe_num is also 0, and lod_qos_is_ost_used() will return false, then object will be allocated on the first OST device.

            While if file stripe starting from a number other than 0, when the loop comes to which ost_idx is 0, the lod_qos_is_ost_used(env, 0, stripe_num) will return true, and the 1st OST device will be skipped.

            The fix should be in lod_qos_ost_in_use_clear(). With following patch, the object stripe allocation will be correct.

            diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
            index 2b81ad8..2f46e7c 100644
            --- a/lustre/lod/lod_qos.c
            +++ b/lustre/lod/lod_qos.c
            @@ -629,7 +629,7 @@ static inline int lod_qos_ost_in_use_clear(const struct lu_env *env, int stripes
                            CERROR("can't allocate memory for ost-in-use array\n");
                            return -ENOMEM;
                    }
            -       memset(info->lti_ea_store, 0, sizeof(int) * stripes);
            +       memset(info->lti_ea_store, -1, sizeof(int) * stripes);
                    return 0;
             }
            
            bobijam Zhenyu Xu added a comment - - edited I found the root cause. in lod_qos_ost_in_use_clear(), the ost_in_use array is initialised to 0, and in lod_qos_prep_create()->old_alloc_specific(), the ost_idx is for (i = 0; i < ost_count; i++, array_idx = (array_idx + 1) % ost_count) { ost_idx = osts->op_array[array_idx]; and the ost_idx will be checked upon ost_in_use array if (lod_qos_is_ost_used(env, ost_idx, stripe_num)) continue ; If the stripe_offset starts from 0, and in the 1st iteration, stripe_num is also 0, and lod_qos_is_ost_used() will return false, then object will be allocated on the first OST device. While if file stripe starting from a number other than 0, when the loop comes to which ost_idx is 0, the lod_qos_is_ost_used(env, 0, stripe_num) will return true, and the 1st OST device will be skipped. The fix should be in lod_qos_ost_in_use_clear(). With following patch, the object stripe allocation will be correct. diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c index 2b81ad8..2f46e7c 100644 --- a/lustre/lod/lod_qos.c +++ b/lustre/lod/lod_qos.c @@ -629,7 +629,7 @@ static inline int lod_qos_ost_in_use_clear( const struct lu_env *env, int stripes CERROR( "can't allocate memory for ost-in-use array\n" ); return -ENOMEM; } - memset(info->lti_ea_store, 0, sizeof( int ) * stripes); + memset(info->lti_ea_store, -1, sizeof( int ) * stripes); return 0; }
            emoly.liu Emoly Liu added a comment -

            I Added some debug messages,

            diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c
            index 17dca0c..851caee 100644
            --- a/lustre/lod/lod_dev.c
            +++ b/lustre/lod/lod_dev.c
            @@ -68,7 +68,10 @@ int lod_fld_lookup(const struct lu_env *env, struct lod_device *lod,
             
                    LASSERTF(fid_is_sane(fid), "Invalid FID "DFID"\n", PFID(fid));
                    if (fid_is_idif(fid)) {
            +               printk("before ostindex:%d, FID "DFID"\n",
            +                      cpu_to_le32(*tgt), PFID(fid));
                            *tgt = fid_idif_ost_idx(fid);
            +               printk("after ostindex:%d\n", cpu_to_le32(*tgt));
                            RETURN(rc);
                    }
            diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
            index f6dad39..39decd5 100644
            --- a/lustre/osd-ldiskfs/osd_handler.c
            +++ b/lustre/osd-ldiskfs/osd_handler.c
            @@ -2206,6 +2206,20 @@ static inline int __osd_xattr_set(struct osd_thread_info *info,
             
                    ll_vfs_dq_init(inode);
                    dentry->d_inode = inode;
            +       if (strcmp(name, XATTR_NAME_LOV) == 0) {
            +               struct lov_mds_md_v1 *lmm = (struct lov_mds_md_v1 *)buf;
            +               int stripe_count = lmm->lmm_stripe_count;
            +               struct lov_ost_data *objects = lmm->lmm_objects;
            +               int i;
            +
            +               printk("stripecount=%d, stripesize=%d\n",
            +                      stripe_count, lmm->lmm_stripe_size);
            +               for (i = 0; i < stripe_count; i++) {
            +                      int idx = objects[i].l_ost_idx;
            +                      printk("here ostindex:%d\n", idx);
            +               }
            +       }
                    return inode->i_op->setxattr(dentry, name, buf, buflen, fl);
             }
            

            and dmesg showed

            Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
            before ostindex:-30720, FID [0x100000000:0x1:0x0]
            after ostindex:0
            before ostindex:0, FID [0x100010000:0x1:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x1:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x1:0x0]
            after ostindex:3
            stripecount=4, stripesize=1048576
            here ostindex:0
            here ostindex:1
            here ostindex:2
            here ostindex:3
            before ostindex:-30720, FID [0x100010000:0x2:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x2:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x2:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x3:0x0]
            after ostindex:1
            stripecount=4, stripesize=1048576
            here ostindex:1
            here ostindex:2
            here ostindex:3
            here ostindex:1
            before ostindex:-30720, FID [0x100020000:0x3:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x3:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x4:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x4:0x0]
            after ostindex:2
            stripecount=4, stripesize=1048576
            here ostindex:2
            here ostindex:3
            here ostindex:1
            here ostindex:2
            before ostindex:-30720, FID [0x100030000:0x4:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x5:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x5:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x5:0x0]
            after ostindex:3
            stripecount=4, stripesize=1048576
            here ostindex:3
            here ostindex:1
            here ostindex:2
            here ostindex:3
            

            There is something wrong with fid sequence.

            emoly.liu Emoly Liu added a comment - I Added some debug messages, diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c index 17dca0c..851caee 100644 --- a/lustre/lod/lod_dev.c +++ b/lustre/lod/lod_dev.c @@ -68,7 +68,10 @@ int lod_fld_lookup( const struct lu_env *env, struct lod_device *lod, LASSERTF(fid_is_sane(fid), "Invalid FID " DFID "\n" , PFID(fid)); if (fid_is_idif(fid)) { + printk( "before ostindex:%d, FID " DFID "\n" , + cpu_to_le32(*tgt), PFID(fid)); *tgt = fid_idif_ost_idx(fid); + printk( "after ostindex:%d\n" , cpu_to_le32(*tgt)); RETURN(rc); } diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c index f6dad39..39decd5 100644 --- a/lustre/osd-ldiskfs/osd_handler.c +++ b/lustre/osd-ldiskfs/osd_handler.c @@ -2206,6 +2206,20 @@ static inline int __osd_xattr_set(struct osd_thread_info *info, ll_vfs_dq_init(inode); dentry->d_inode = inode; + if (strcmp(name, XATTR_NAME_LOV) == 0) { + struct lov_mds_md_v1 *lmm = (struct lov_mds_md_v1 *)buf; + int stripe_count = lmm->lmm_stripe_count; + struct lov_ost_data *objects = lmm->lmm_objects; + int i; + + printk( "stripecount=%d, stripesize=%d\n" , + stripe_count, lmm->lmm_stripe_size); + for (i = 0; i < stripe_count; i++) { + int idx = objects[i].l_ost_idx; + printk( "here ostindex:%d\n" , idx); + } + } return inode->i_op->setxattr(dentry, name, buf, buflen, fl); } and dmesg showed Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt before ostindex:-30720, FID [0x100000000:0x1:0x0] after ostindex:0 before ostindex:0, FID [0x100010000:0x1:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x1:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x1:0x0] after ostindex:3 stripecount=4, stripesize=1048576 here ostindex:0 here ostindex:1 here ostindex:2 here ostindex:3 before ostindex:-30720, FID [0x100010000:0x2:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x2:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x2:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x3:0x0] after ostindex:1 stripecount=4, stripesize=1048576 here ostindex:1 here ostindex:2 here ostindex:3 here ostindex:1 before ostindex:-30720, FID [0x100020000:0x3:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x3:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x4:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x4:0x0] after ostindex:2 stripecount=4, stripesize=1048576 here ostindex:2 here ostindex:3 here ostindex:1 here ostindex:2 before ostindex:-30720, FID [0x100030000:0x4:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x5:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x5:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x5:0x0] after ostindex:3 stripecount=4, stripesize=1048576 here ostindex:3 here ostindex:1 here ostindex:2 here ostindex:3 There is something wrong with fid sequence.
            emoly.liu Emoly Liu added a comment -

            OSTCOUNT=4,

            [root@centos6-1 ~]# cd /mnt/lustre
            [root@centos6-1 lustre]# mkdir test;cd test
            [root@centos6-1 test]# for i in 0 1 2 3; do lfs setstripe -i $i -c -1 testfile$i;dd if=/dev/zero of=testfile$i bs=2M count=5;done
            5+0 records in
            5+0 records out
            10485760 bytes (10 MB) copied, 0.0220632 s, 475 MB/s
            5+0 records in
            5+0 records out
            10485760 bytes (10 MB) copied, 0.0193019 s, 543 MB/s
            5+0 records in
            5+0 records out
            10485760 bytes (10 MB) copied, 0.0200823 s, 522 MB/s
            5+0 records in
            5+0 records out
            10485760 bytes (10 MB) copied, 0.0190184 s, 551 MB/s
            [root@centos6-1 test]# lfs getstripe *
            testfile0
            lmm_stripe_count:   4
            lmm_stripe_size:    1048576
            lmm_layout_gen:     0
            lmm_stripe_offset:  0
            	obdidx		 objid		objid		 group
            	     0	             1	          0x1	             0
            	     1	             1	          0x1	             0
            	     2	             1	          0x1	             0
            	     3	             1	          0x1	             0
            
            testfile1
            lmm_stripe_count:   4
            lmm_stripe_size:    1048576
            lmm_layout_gen:     0
            lmm_stripe_offset:  1
            	obdidx		 objid		objid		 group
            	     1	             2	          0x2	             0
            	     2	             2	          0x2	             0
            	     3	             2	          0x2	             0
            	     1	             3	          0x3	             0
            
            testfile2
            lmm_stripe_count:   4
            lmm_stripe_size:    1048576
            lmm_layout_gen:     0
            lmm_stripe_offset:  2
            	obdidx		 objid		objid		 group
            	     2	             3	          0x3	             0
            	     3	             3	          0x3	             0
            	     1	             4	          0x4	             0
            	     2	             4	          0x4	             0
            
            testfile3
            lmm_stripe_count:   4
            lmm_stripe_size:    1048576
            lmm_layout_gen:     0
            lmm_stripe_offset:  3
            	obdidx		 objid		objid		 group
            	     3	             4	          0x4	             0
            	     1	             5	          0x5	             0
            	     2	             5	          0x5	             0
            	     3	             5	          0x5	             0
            
            emoly.liu Emoly Liu added a comment - OSTCOUNT=4, [root@centos6-1 ~]# cd /mnt/lustre [root@centos6-1 lustre]# mkdir test;cd test [root@centos6-1 test]# for i in 0 1 2 3; do lfs setstripe -i $i -c -1 testfile$i;dd if =/dev/zero of=testfile$i bs=2M count=5;done 5+0 records in 5+0 records out 10485760 bytes (10 MB) copied, 0.0220632 s, 475 MB/s 5+0 records in 5+0 records out 10485760 bytes (10 MB) copied, 0.0193019 s, 543 MB/s 5+0 records in 5+0 records out 10485760 bytes (10 MB) copied, 0.0200823 s, 522 MB/s 5+0 records in 5+0 records out 10485760 bytes (10 MB) copied, 0.0190184 s, 551 MB/s [root@centos6-1 test]# lfs getstripe * testfile0 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 1 0x1 0 1 1 0x1 0 2 1 0x1 0 3 1 0x1 0 testfile1 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 1 obdidx objid objid group 1 2 0x2 0 2 2 0x2 0 3 2 0x2 0 1 3 0x3 0 testfile2 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 2 obdidx objid objid group 2 3 0x3 0 3 3 0x3 0 1 4 0x4 0 2 4 0x4 0 testfile3 lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 3 obdidx objid objid group 3 4 0x4 0 1 5 0x5 0 2 5 0x5 0 3 5 0x5 0

            People

              emoly.liu Emoly Liu
              emoly.liu Emoly Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: