Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5778

MDS not creating files on OSTs properly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.5.2
    • CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64
    • 3
    • 16216

    Description

      One of our Stampede filesystems running Lustre 2.5.2 has an OST offline due to a different problem described in another ticket. Since the OST has been offline, the MDS server crashed with an LBUG and was restarted last Friday. After the restart, the MDS server no longer automatically creates files on any OSTs after the offline OSTs. In our case, OST0010 is offline so now the MDS will only create files on the first 16 OSTs unless we manually specify the stripeoffset in lfs setstripe. This is overloading the the servers with these OSTs while the others are doing nothing. If we deactivate the first 16 OSTs on the MDS, then all files are created with the first stripe on the lowest numbered active OST.

      Can you suggest any way to force the MDS to use all the other OSTs through any lctl set_param options? Getting the offline OST back online is not currently an option due to corruption and ongoing e2fsck, it can't be mounted. Manually setting the stripe is also not an option, we need it to work automatically like it should. Could we set some qos options to try and have it balance the OST file creation?

      Attachments

        1. lctl_state.out
          44 kB
        2. lctl_target_obd.out
          11 kB
        3. LU-5778_file_create_getstripe.out.gz
          12 kB
        4. LU-5778.debug_filtered.bz2
          30 kB
        5. mds5_prealloc.out
          128 kB

        Issue Links

          Activity

            [LU-5778] MDS not creating files on OSTs properly
            adegremont Aurelien Degremont (Inactive) added a comment - Patch for b2_5: http://review.whamcloud.com/12685
            adegremont Aurelien Degremont (Inactive) added a comment - I've pushed: http://review.whamcloud.com/12617

            I will try to do that

            adegremont Aurelien Degremont (Inactive) added a comment - I will try to do that

            Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs
            and not in &OST_TGT(lod,idx)->ltd_statfs

            Indeed, I think you are right.

            The question could be why we have 2 structures to put same ost counter ?

            I think it because we don't want to get lq_rw_sem in lod_alloc_qos(), that would introduce lots of contention.

            Hi, Aurelien
            Would you mind to post a patch to fix this? (replace the 'break' with 'continue' in lod_qos_statfs_update()). Thanks.

            niu Niu Yawei (Inactive) added a comment - Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs and not in &OST_TGT(lod,idx)->ltd_statfs Indeed, I think you are right. The question could be why we have 2 structures to put same ost counter ? I think it because we don't want to get lq_rw_sem in lod_alloc_qos(), that would introduce lots of contention. Hi, Aurelien Would you mind to post a patch to fix this? (replace the 'break' with 'continue' in lod_qos_statfs_update()). Thanks.

            Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs
            and not in &OST_TGT(lod,idx)->ltd_statfs
            The question could be why we have 2 structures to put same ost counter ?

            apercher Antoine Percher added a comment - Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs and not in &OST_TGT(lod,idx)->ltd_statfs The question could be why we have 2 structures to put same ost counter ?

            Niu, can you take a look and confirm?

            I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one.

            In the lod_alloc_qos(), statfs will be performed on all OSPs again:

                    /* Find all the OSTs that are valid stripe candidates */
                    for (i = 0; i < osts->op_count; i++) {
                            if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i]))
                                    continue;
            
                            rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs);
                            if (rc) {
                                    /* this OSP doesn't feel well */
                                    continue;
                            }
            
            niu Niu Yawei (Inactive) added a comment - Niu, can you take a look and confirm? I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one. In the lod_alloc_qos(), statfs will be performed on all OSPs again: /* Find all the OSTs that are valid stripe candidates */ for (i = 0; i < osts->op_count; i++) { if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i])) continue ; rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs); if (rc) { /* this OSP doesn't feel well */ continue ; }

            As far as I can see in current master branch, the problem is still there.

            adegremont Aurelien Degremont (Inactive) added a comment - As far as I can see in current master branch, the problem is still there.

            iirc, originally osp_statfs() wasn't returning an error, instead it claimed "empty" OST if the connection is down.

            bzzz Alex Zhuravlev added a comment - iirc, originally osp_statfs() wasn't returning an error, instead it claimed "empty" OST if the connection is down.

            Thanks for the workaround Aurelien, will keep this in mind if we encounter this situation again. Do you know if this problem is still in the main branch of the source tree or is it just for the 2.5.x versions like we are running? If so, should be fixed a simple fix as you noted.

            Niu, can you take a look and confirm?

            minyard Tommy Minyard added a comment - Thanks for the workaround Aurelien, will keep this in mind if we encounter this situation again. Do you know if this problem is still in the main branch of the source tree or is it just for the 2.5.x versions like we are running? If so, should be fixed a simple fix as you noted. Niu, can you take a look and confirm?

            We faced the same issue at CEA. We analyze the problem and tracked it down to lod_qos_statfs_update(). Indead, in this function, each OST is checked successively. If a OST has active=0, lod_statfs_and_check() will return ENOTCONN and the for-loop will break. Following OSTs won't be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects.

                    for (i = 0; i < osts->op_count; i++) {
                            idx = osts->op_array[i];
                            avail = OST_TGT(lod,idx)->ltd_statfs.os_bavail;
                            rc = lod_statfs_and_check(env, lod, idx,
                                                      &OST_TGT(lod,idx)->ltd_statfs);
                            if (rc)
                                    break;
                            if (OST_TGT(lod,idx)->ltd_statfs.os_bavail != avail)
                                    /* recalculate weigths */
                                    lod->lod_qos.lq_dirty = 1;
                    }
            

            A simple workaround for this bug is to deactivate QOS allocation setting qos_threshold_rr to 100 and so going back to a simple round-robin.

            A fix seems to simply replace the break with continue

            adegremont Aurelien Degremont (Inactive) added a comment - We faced the same issue at CEA. We analyze the problem and tracked it down to lod_qos_statfs_update() . Indead, in this function, each OST is checked successively. If a OST has active=0, lod_statfs_and_check() will return ENOTCONN and the for-loop will break. Following OSTs won't be checked. Their metadata will not be updated (zero) and they will not be used when allocating file objects. for (i = 0; i < osts->op_count; i++) { idx = osts->op_array[i]; avail = OST_TGT(lod,idx)->ltd_statfs.os_bavail; rc = lod_statfs_and_check(env, lod, idx, &OST_TGT(lod,idx)->ltd_statfs); if (rc) break ; if (OST_TGT(lod,idx)->ltd_statfs.os_bavail != avail) /* recalculate weigths */ lod->lod_qos.lq_dirty = 1; } A simple workaround for this bug is to deactivate QOS allocation setting qos_threshold_rr to 100 and so going back to a simple round-robin. A fix seems to simply replace the break with continue

            Oleg,
            In regard to the file creation, we did have some directories set with a manual stripe offset and not -1, so the stripe creation was being forced onto specific OSTs. I ran an lfs getstripe on the 500 created and they all landed on objidx less than 16. I've attached the output from that file create. As you noted, a large number of OSTs were inactive due us trying to distribute files across the other OSTs, but in this case, all files still had their stripes on the first 16 OSTs.

            This ticket is no longer critical for us since we were able to recreate OST0010 over the weekend and once it was mounted and checked in with the mds, then file creation started across all the active OSTs once again like it should. This specific issue must have been due to the OST being offline and not able to check in with the MDS when the MDS had to be restarted after an LBUG crash. We certainly do not get into this state often as usually all OSTs are available and not in a state that they can't be mounted.

            minyard Tommy Minyard added a comment - Oleg, In regard to the file creation, we did have some directories set with a manual stripe offset and not -1, so the stripe creation was being forced onto specific OSTs. I ran an lfs getstripe on the 500 created and they all landed on objidx less than 16. I've attached the output from that file create. As you noted, a large number of OSTs were inactive due us trying to distribute files across the other OSTs, but in this case, all files still had their stripes on the first 16 OSTs. This ticket is no longer critical for us since we were able to recreate OST0010 over the weekend and once it was mounted and checked in with the mds, then file creation started across all the active OSTs once again like it should. This specific issue must have been due to the OST being offline and not able to check in with the MDS when the MDS had to be restarted after an LBUG crash. We certainly do not get into this state often as usually all OSTs are available and not in a state that they can't be mounted.

            People

              niu Niu Yawei (Inactive)
              minyard Tommy Minyard
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: