Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5778

MDS not creating files on OSTs properly

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.5.2
    • CentOS 6.5, kernel 2.6.32-431.17.1.el6_lustre.x86_64
    • 3
    • 16216

    Description

      One of our Stampede filesystems running Lustre 2.5.2 has an OST offline due to a different problem described in another ticket. Since the OST has been offline, the MDS server crashed with an LBUG and was restarted last Friday. After the restart, the MDS server no longer automatically creates files on any OSTs after the offline OSTs. In our case, OST0010 is offline so now the MDS will only create files on the first 16 OSTs unless we manually specify the stripeoffset in lfs setstripe. This is overloading the the servers with these OSTs while the others are doing nothing. If we deactivate the first 16 OSTs on the MDS, then all files are created with the first stripe on the lowest numbered active OST.

      Can you suggest any way to force the MDS to use all the other OSTs through any lctl set_param options? Getting the offline OST back online is not currently an option due to corruption and ongoing e2fsck, it can't be mounted. Manually setting the stripe is also not an option, we need it to work automatically like it should. Could we set some qos options to try and have it balance the OST file creation?

      Attachments

        1. lctl_state.out
          44 kB
        2. lctl_target_obd.out
          11 kB
        3. LU-5778_file_create_getstripe.out.gz
          12 kB
        4. LU-5778.debug_filtered.bz2
          30 kB
        5. mds5_prealloc.out
          128 kB

        Issue Links

          Activity

            [LU-5778] MDS not creating files on OSTs properly
            pjones Peter Jones added a comment -

            Fix landed for 2.5.4 and 2.7

            pjones Peter Jones added a comment - Fix landed for 2.5.4 and 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12685/
            Subject: LU-5778 lod: Fix lod_qos_statfs_update()
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: cebb8fd03635f2f4e8f17c3a902eeba8008b07c4

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12685/ Subject: LU-5778 lod: Fix lod_qos_statfs_update() Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: cebb8fd03635f2f4e8f17c3a902eeba8008b07c4

            I've finally found time for a reproducer.

            cd /usr/lib64/lustre/tests
            OSTCOUNT=5 MGSDEV=/tmp/lustre-mgs ./llmount.sh
            
            # Fill the first OST to unbalance disk space and activate qos algorithm
            lfs setstripe -c1 -i0 /mnt/lustre/fill.0
            dd if=/dev/zero of=/mnt/lustre/fill.0 bs=1M count=100
            
            sync
            lfs df -h /mnt/lustre
            
            # Stop OST #2 to avoid MDS getting stats from it
            umount /mnt/ost3
            # Deactivate this OST
            lctl conf_param lustre-OST0002.osc.active=0
            
            # Re-start MDT to clear cached data
            umount /mnt/mds1
            mount -t lustre /tmp/lustre-mdt1 /mnt/mds1 -o loop
            
            # Now create a lot of files and check how they are stripped
            for i in {1..50}; do lfs setstripe -c1 /mnt/lustre/file.$i; done
            lfs getstripe /mnt/lustre/file.* | awk '/0x/ {print $1}' | sort | uniq -c
            

            Files are striped only on OST #0, #1 and nothing on #3 and #4, even if #0 is almost full and #1, #3 and #4 are empty. (#2 is deactivated and stopped)

            Maybe I will find time to create a test case after SC14.

            adegremont Aurelien Degremont (Inactive) added a comment - I've finally found time for a reproducer. cd /usr/lib64/lustre/tests OSTCOUNT=5 MGSDEV=/tmp/lustre-mgs ./llmount.sh # Fill the first OST to unbalance disk space and activate qos algorithm lfs setstripe -c1 -i0 /mnt/lustre/fill.0 dd if =/dev/zero of=/mnt/lustre/fill.0 bs=1M count=100 sync lfs df -h /mnt/lustre # Stop OST #2 to avoid MDS getting stats from it umount /mnt/ost3 # Deactivate this OST lctl conf_param lustre-OST0002.osc.active=0 # Re-start MDT to clear cached data umount /mnt/mds1 mount -t lustre /tmp/lustre-mdt1 /mnt/mds1 -o loop # Now create a lot of files and check how they are stripped for i in {1..50}; do lfs setstripe -c1 /mnt/lustre/file.$i; done lfs getstripe /mnt/lustre/file.* | awk '/0x/ {print $1}' | sort | uniq -c Files are striped only on OST #0, #1 and nothing on #3 and #4, even if #0 is almost full and #1, #3 and #4 are empty. (#2 is deactivated and stopped) Maybe I will find time to create a test case after SC14.
            adegremont Aurelien Degremont (Inactive) added a comment - Patch for b2_5: http://review.whamcloud.com/12685
            adegremont Aurelien Degremont (Inactive) added a comment - I've pushed: http://review.whamcloud.com/12617

            I will try to do that

            adegremont Aurelien Degremont (Inactive) added a comment - I will try to do that

            Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs
            and not in &OST_TGT(lod,idx)->ltd_statfs

            Indeed, I think you are right.

            The question could be why we have 2 structures to put same ost counter ?

            I think it because we don't want to get lq_rw_sem in lod_alloc_qos(), that would introduce lots of contention.

            Hi, Aurelien
            Would you mind to post a patch to fix this? (replace the 'break' with 'continue' in lod_qos_statfs_update()). Thanks.

            niu Niu Yawei (Inactive) added a comment - Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs and not in &OST_TGT(lod,idx)->ltd_statfs Indeed, I think you are right. The question could be why we have 2 structures to put same ost counter ? I think it because we don't want to get lq_rw_sem in lod_alloc_qos(), that would introduce lots of contention. Hi, Aurelien Would you mind to post a patch to fix this? (replace the 'break' with 'continue' in lod_qos_statfs_update()). Thanks.

            Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs
            and not in &OST_TGT(lod,idx)->ltd_statfs
            The question could be why we have 2 structures to put same ost counter ?

            apercher Antoine Percher added a comment - Because in this case lod_statfs_and_check() put the result on sfs who are &lod_env_info(env)->lti_osfs and not in &OST_TGT(lod,idx)->ltd_statfs The question could be why we have 2 structures to put same ost counter ?

            Niu, can you take a look and confirm?

            I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one.

            In the lod_alloc_qos(), statfs will be performed on all OSPs again:

                    /* Find all the OSTs that are valid stripe candidates */
                    for (i = 0; i < osts->op_count; i++) {
                            if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i]))
                                    continue;
            
                            rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs);
                            if (rc) {
                                    /* this OSP doesn't feel well */
                                    continue;
                            }
            
            niu Niu Yawei (Inactive) added a comment - Niu, can you take a look and confirm? I agree with that 'break' here is inappropriate, but I don't see why this can leads to the result of always allocating objects on the OSTs before the deactivated one. In the lod_alloc_qos(), statfs will be performed on all OSPs again: /* Find all the OSTs that are valid stripe candidates */ for (i = 0; i < osts->op_count; i++) { if (!cfs_bitmap_check(m->lod_ost_bitmap, osts->op_array[i])) continue ; rc = lod_statfs_and_check(env, m, osts->op_array[i], sfs); if (rc) { /* this OSP doesn't feel well */ continue ; }

            As far as I can see in current master branch, the problem is still there.

            adegremont Aurelien Degremont (Inactive) added a comment - As far as I can see in current master branch, the problem is still there.

            People

              niu Niu Yawei (Inactive)
              minyard Tommy Minyard
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: