Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2871

Data can't be striped across all the OSTs correctly by running "lfs setstripe -c -1 -i n" (n>0)

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 6940

    Description

      I found this problem during the investigation on LU-2809. While running "lfs setstripe -c -1 -i n testfile", if ost index n doesn't start from 0, it shows that data can't be striped across all the OSTs and OST0 is always ignored.

      Attachments

        Issue Links

          Activity

            [LU-2871] Data can't be striped across all the OSTs correctly by running "lfs setstripe -c -1 -i n" (n>0)
            emoly.liu Emoly Liu added a comment -

            Landed for 2.4

            emoly.liu Emoly Liu added a comment - Landed for 2.4
            emoly.liu Emoly Liu added a comment -

            Sure, I made the both changes in the patch and will update it per Ned Bass' advice later. Thanks!

            emoly.liu Emoly Liu added a comment - Sure, I made the both changes in the patch and will update it per Ned Bass' advice later. Thanks!

            liuying, it's better to mark index used after successful lod_qos_declare_object_on(). and I don't think this is an option to the change by Zhenyu Xu, I think the both changes should be applied.

            bzzz Alex Zhuravlev added a comment - liuying, it's better to mark index used after successful lod_qos_declare_object_on(). and I don't think this is an option to the change by Zhenyu Xu, I think the both changes should be applied.
            emoly.liu Emoly Liu added a comment - Patch is at http://review.whamcloud.com/5554
            emoly.liu Emoly Liu added a comment -

            I will add a sanity test for this case.

            emoly.liu Emoly Liu added a comment - I will add a sanity test for this case.
            emoly.liu Emoly Liu added a comment - - edited

            Another way, run lod_qos_ost_in_use() after lod_qos_is_ost_used() check, right?

            diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
            index 2b81ad8..92b3b36 100644
            --- a/lustre/lod/lod_qos.c
            +++ b/lustre/lod/lod_qos.c
            @@ -887,6 +887,7 @@ repeat_find:
                             */
                            if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                                    continue;
            +               lod_qos_ost_in_use(env, stripe_num, ost_idx);
             
                            /* Drop slow OSCs if we can, but not for requested start idx.
                             *
            
            emoly.liu Emoly Liu added a comment - - edited Another way, run lod_qos_ost_in_use() after lod_qos_is_ost_used() check, right? diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c index 2b81ad8..92b3b36 100644 --- a/lustre/lod/lod_qos.c +++ b/lustre/lod/lod_qos.c @@ -887,6 +887,7 @@ repeat_find: */ if (lod_qos_is_ost_used(env, ost_idx, stripe_num)) continue ; + lod_qos_ost_in_use(env, stripe_num, ost_idx); /* Drop slow OSCs if we can, but not for requested start idx. *

            pretty much correct. please put a patch into gerrit, thanks.

            bzzz Alex Zhuravlev added a comment - pretty much correct. please put a patch into gerrit, thanks.
            bobijam Zhenyu Xu added a comment - - edited

            I found the root cause.

            in lod_qos_ost_in_use_clear(), the ost_in_use array is initialised to 0, and in lod_qos_prep_create()->old_alloc_specific(), the ost_idx is

                    for (i = 0; i < ost_count;
                                    i++, array_idx = (array_idx + 1) % ost_count) {
                            ost_idx = osts->op_array[array_idx];
            

            and the ost_idx will be checked upon ost_in_use array

                            if (lod_qos_is_ost_used(env, ost_idx, stripe_num))
                                    continue;
            

            If the stripe_offset starts from 0, and in the 1st iteration, stripe_num is also 0, and lod_qos_is_ost_used() will return false, then object will be allocated on the first OST device.

            While if file stripe starting from a number other than 0, when the loop comes to which ost_idx is 0, the lod_qos_is_ost_used(env, 0, stripe_num) will return true, and the 1st OST device will be skipped.

            The fix should be in lod_qos_ost_in_use_clear(). With following patch, the object stripe allocation will be correct.

            diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c
            index 2b81ad8..2f46e7c 100644
            --- a/lustre/lod/lod_qos.c
            +++ b/lustre/lod/lod_qos.c
            @@ -629,7 +629,7 @@ static inline int lod_qos_ost_in_use_clear(const struct lu_env *env, int stripes
                            CERROR("can't allocate memory for ost-in-use array\n");
                            return -ENOMEM;
                    }
            -       memset(info->lti_ea_store, 0, sizeof(int) * stripes);
            +       memset(info->lti_ea_store, -1, sizeof(int) * stripes);
                    return 0;
             }
            
            bobijam Zhenyu Xu added a comment - - edited I found the root cause. in lod_qos_ost_in_use_clear(), the ost_in_use array is initialised to 0, and in lod_qos_prep_create()->old_alloc_specific(), the ost_idx is for (i = 0; i < ost_count; i++, array_idx = (array_idx + 1) % ost_count) { ost_idx = osts->op_array[array_idx]; and the ost_idx will be checked upon ost_in_use array if (lod_qos_is_ost_used(env, ost_idx, stripe_num)) continue ; If the stripe_offset starts from 0, and in the 1st iteration, stripe_num is also 0, and lod_qos_is_ost_used() will return false, then object will be allocated on the first OST device. While if file stripe starting from a number other than 0, when the loop comes to which ost_idx is 0, the lod_qos_is_ost_used(env, 0, stripe_num) will return true, and the 1st OST device will be skipped. The fix should be in lod_qos_ost_in_use_clear(). With following patch, the object stripe allocation will be correct. diff --git a/lustre/lod/lod_qos.c b/lustre/lod/lod_qos.c index 2b81ad8..2f46e7c 100644 --- a/lustre/lod/lod_qos.c +++ b/lustre/lod/lod_qos.c @@ -629,7 +629,7 @@ static inline int lod_qos_ost_in_use_clear( const struct lu_env *env, int stripes CERROR( "can't allocate memory for ost-in-use array\n" ); return -ENOMEM; } - memset(info->lti_ea_store, 0, sizeof( int ) * stripes); + memset(info->lti_ea_store, -1, sizeof( int ) * stripes); return 0; }
            emoly.liu Emoly Liu added a comment -

            I Added some debug messages,

            diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c
            index 17dca0c..851caee 100644
            --- a/lustre/lod/lod_dev.c
            +++ b/lustre/lod/lod_dev.c
            @@ -68,7 +68,10 @@ int lod_fld_lookup(const struct lu_env *env, struct lod_device *lod,
             
                    LASSERTF(fid_is_sane(fid), "Invalid FID "DFID"\n", PFID(fid));
                    if (fid_is_idif(fid)) {
            +               printk("before ostindex:%d, FID "DFID"\n",
            +                      cpu_to_le32(*tgt), PFID(fid));
                            *tgt = fid_idif_ost_idx(fid);
            +               printk("after ostindex:%d\n", cpu_to_le32(*tgt));
                            RETURN(rc);
                    }
            diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c
            index f6dad39..39decd5 100644
            --- a/lustre/osd-ldiskfs/osd_handler.c
            +++ b/lustre/osd-ldiskfs/osd_handler.c
            @@ -2206,6 +2206,20 @@ static inline int __osd_xattr_set(struct osd_thread_info *info,
             
                    ll_vfs_dq_init(inode);
                    dentry->d_inode = inode;
            +       if (strcmp(name, XATTR_NAME_LOV) == 0) {
            +               struct lov_mds_md_v1 *lmm = (struct lov_mds_md_v1 *)buf;
            +               int stripe_count = lmm->lmm_stripe_count;
            +               struct lov_ost_data *objects = lmm->lmm_objects;
            +               int i;
            +
            +               printk("stripecount=%d, stripesize=%d\n",
            +                      stripe_count, lmm->lmm_stripe_size);
            +               for (i = 0; i < stripe_count; i++) {
            +                      int idx = objects[i].l_ost_idx;
            +                      printk("here ostindex:%d\n", idx);
            +               }
            +       }
                    return inode->i_op->setxattr(dentry, name, buf, buflen, fl);
             }
            

            and dmesg showed

            Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
            before ostindex:-30720, FID [0x100000000:0x1:0x0]
            after ostindex:0
            before ostindex:0, FID [0x100010000:0x1:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x1:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x1:0x0]
            after ostindex:3
            stripecount=4, stripesize=1048576
            here ostindex:0
            here ostindex:1
            here ostindex:2
            here ostindex:3
            before ostindex:-30720, FID [0x100010000:0x2:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x2:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x2:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x3:0x0]
            after ostindex:1
            stripecount=4, stripesize=1048576
            here ostindex:1
            here ostindex:2
            here ostindex:3
            here ostindex:1
            before ostindex:-30720, FID [0x100020000:0x3:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x3:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x4:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x4:0x0]
            after ostindex:2
            stripecount=4, stripesize=1048576
            here ostindex:2
            here ostindex:3
            here ostindex:1
            here ostindex:2
            before ostindex:-30720, FID [0x100030000:0x4:0x0]
            after ostindex:3
            before ostindex:3, FID [0x100010000:0x5:0x0]
            after ostindex:1
            before ostindex:1, FID [0x100020000:0x5:0x0]
            after ostindex:2
            before ostindex:2, FID [0x100030000:0x5:0x0]
            after ostindex:3
            stripecount=4, stripesize=1048576
            here ostindex:3
            here ostindex:1
            here ostindex:2
            here ostindex:3
            

            There is something wrong with fid sequence.

            emoly.liu Emoly Liu added a comment - I Added some debug messages, diff --git a/lustre/lod/lod_dev.c b/lustre/lod/lod_dev.c index 17dca0c..851caee 100644 --- a/lustre/lod/lod_dev.c +++ b/lustre/lod/lod_dev.c @@ -68,7 +68,10 @@ int lod_fld_lookup( const struct lu_env *env, struct lod_device *lod, LASSERTF(fid_is_sane(fid), "Invalid FID " DFID "\n" , PFID(fid)); if (fid_is_idif(fid)) { + printk( "before ostindex:%d, FID " DFID "\n" , + cpu_to_le32(*tgt), PFID(fid)); *tgt = fid_idif_ost_idx(fid); + printk( "after ostindex:%d\n" , cpu_to_le32(*tgt)); RETURN(rc); } diff --git a/lustre/osd-ldiskfs/osd_handler.c b/lustre/osd-ldiskfs/osd_handler.c index f6dad39..39decd5 100644 --- a/lustre/osd-ldiskfs/osd_handler.c +++ b/lustre/osd-ldiskfs/osd_handler.c @@ -2206,6 +2206,20 @@ static inline int __osd_xattr_set(struct osd_thread_info *info, ll_vfs_dq_init(inode); dentry->d_inode = inode; + if (strcmp(name, XATTR_NAME_LOV) == 0) { + struct lov_mds_md_v1 *lmm = (struct lov_mds_md_v1 *)buf; + int stripe_count = lmm->lmm_stripe_count; + struct lov_ost_data *objects = lmm->lmm_objects; + int i; + + printk( "stripecount=%d, stripesize=%d\n" , + stripe_count, lmm->lmm_stripe_size); + for (i = 0; i < stripe_count; i++) { + int idx = objects[i].l_ost_idx; + printk( "here ostindex:%d\n" , idx); + } + } return inode->i_op->setxattr(dentry, name, buf, buflen, fl); } and dmesg showed Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt before ostindex:-30720, FID [0x100000000:0x1:0x0] after ostindex:0 before ostindex:0, FID [0x100010000:0x1:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x1:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x1:0x0] after ostindex:3 stripecount=4, stripesize=1048576 here ostindex:0 here ostindex:1 here ostindex:2 here ostindex:3 before ostindex:-30720, FID [0x100010000:0x2:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x2:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x2:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x3:0x0] after ostindex:1 stripecount=4, stripesize=1048576 here ostindex:1 here ostindex:2 here ostindex:3 here ostindex:1 before ostindex:-30720, FID [0x100020000:0x3:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x3:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x4:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x4:0x0] after ostindex:2 stripecount=4, stripesize=1048576 here ostindex:2 here ostindex:3 here ostindex:1 here ostindex:2 before ostindex:-30720, FID [0x100030000:0x4:0x0] after ostindex:3 before ostindex:3, FID [0x100010000:0x5:0x0] after ostindex:1 before ostindex:1, FID [0x100020000:0x5:0x0] after ostindex:2 before ostindex:2, FID [0x100030000:0x5:0x0] after ostindex:3 stripecount=4, stripesize=1048576 here ostindex:3 here ostindex:1 here ostindex:2 here ostindex:3 There is something wrong with fid sequence.

            People

              emoly.liu Emoly Liu
              emoly.liu Emoly Liu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: