Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11115

OST selection algorithm broken with max_create_count=0 or empty OSTs

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.5
    • Lustre 2.7.0, Lustre 2.10.0
    • None
    • Server running:
      CentOS-6 lcentos-release-6-9.el6.12.3.x86_64
      lustre-2.7.3-1nasS_mofed33v3g_2.6.32_642.15.1.el6.20170609.x86_64.lustre273.x86_64
    • 3
    • 9223372036854775807

    Description

      We have blocked new object creation to some of our OSTs with commands like:

      lctl set_param osp.$OSTNAME.max_create_count=0

      This is to drain data off of storage to be repurposed as spares. Three targets are already at 0%, and confirmed to have no remaining objects with e2scan and lester. 11 other targets are blocked and data is being migrated off.

      Noticed that a few of the other targets were filling up, while others had plenty of space. Watching it over a few days and the imbalance is getting worse.

      Confirmed that we are using default allocation settings:

      nbp7-mds1 ~ # lctl get_param lov.*.qos_* 
      lov.nbp7-MDT0000-mdtlov.qos_maxage=5 Sec
      lov.nbp7-MDT0000-mdtlov.qos_prio_free=91%
      lov.nbp7-MDT0000-mdtlov.qos_threshold_rr=17%

      Tests creating 100k new files of stripe count 1 showed that the more full OSTs are indeed getting allocated objects more often.

      This looks like it might be similar to LU-10823.

       

      Attachments

        Issue Links

          Activity

            [LU-11115] OST selection algorithm broken with max_create_count=0 or empty OSTs

            Peter, we need a nasa label on this ticket. Thanks.

            jaylan Jay Lan (Inactive) added a comment - Peter, we need a nasa label on this ticket. Thanks.
            pjones Peter Jones added a comment -

            Landed for 2.12

            pjones Peter Jones added a comment - Landed for 2.12

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32823/
            Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 5b147e47de651f1c140f69314a2d6b56ff6b14d7

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32823/ Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms Project: fs/lustre-release Branch: master Current Patch Set: Commit: 5b147e47de651f1c140f69314a2d6b56ff6b14d7

            Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32859
            Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 7889706668c671cdadb8febfe819f7e475bdf257

            gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32859 Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 7889706668c671cdadb8febfe819f7e475bdf257
            yujian Jian Yu added a comment -

            Sure, Jay.

            yujian Jian Yu added a comment - Sure, Jay.

            Hi Jian Yu,

            Could you back port your patch to b2_10?

            There were conflicts. Thanks!

             

            Changes to be committed:

            modified: lustre/lod/lod_qos.c
            modified: lustre/osp/osp_precreate.c

            Unmerged paths:
            (use "git add/rm <file>..." as appropriate to mark resolution)

            deleted by us: lustre/include/uapi/linux/lustre/lustre_user.h
            both modified: lustre/lod/lod_object.c

            jaylan Jay Lan (Inactive) added a comment - Hi Jian Yu, Could you back port your patch to b2_10? There were conflicts. Thanks!   Changes to be committed: modified: lustre/lod/lod_qos.c modified: lustre/osp/osp_precreate.c Unmerged paths: (use "git add/rm <file>..." as appropriate to mark resolution) deleted by us: lustre/include/uapi/linux/lustre/lustre_user.h both modified: lustre/lod/lod_object.c

            Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32823
            Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 9c390821eec3eea991b820544dce52fba3a73494

            gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32823 Subject: LU-11115 lod: skip max_create_count=0 OST in QoS and RR algorithms Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 9c390821eec3eea991b820544dce52fba3a73494
            yujian Jian Yu added a comment -

            The issue can be reproduced.

            Lustre debug log on MDS shows that while choosing OST to create object, function lod_qos_prep_create() will try QoS algorithm first by calling lod_alloc_qos(). If free space is distributed evenly among OSTs, lod_alloc_qos() will return -EAGAIN, then lod_qos_prep_create() will call lod_alloc_rr() to use RR algorithm.

            Both in lod_alloc_qos() and lod_alloc_rr(), function lod_statfs_and_check() is used to check whether the OST target is available for new OST objects or not. However, OST target with max_create_count=0 is not checked in that function and just returned as an available OST.

            This issue affects lod_alloc_qos(), but not lod_alloc_rr() because the following extra codes are called in lod_check_and_reserve_ost() to check and skip OST target with max_create_count=0:

            lod_check_and_reserve_ost()
                    /*
                     * We expect number of precreated objects in f_ffree at
                     * the first iteration, skip OSPs with no objects ready
                     */
                    if (sfs->os_fprecreated == 0 && speed == 0) {
                            QOS_DEBUG("#%d: precreation is empty\n", ost_idx);
                            goto out_return;
                    }
            

            I'm creating a patch to fix lod_alloc_qos().

            yujian Jian Yu added a comment - The issue can be reproduced. Lustre debug log on MDS shows that while choosing OST to create object, function lod_qos_prep_create() will try QoS algorithm first by calling lod_alloc_qos(). If free space is distributed evenly among OSTs, lod_alloc_qos() will return -EAGAIN, then lod_qos_prep_create() will call lod_alloc_rr() to use RR algorithm. Both in lod_alloc_qos() and lod_alloc_rr(), function lod_statfs_and_check() is used to check whether the OST target is available for new OST objects or not. However, OST target with max_create_count=0 is not checked in that function and just returned as an available OST. This issue affects lod_alloc_qos(), but not lod_alloc_rr() because the following extra codes are called in lod_check_and_reserve_ost() to check and skip OST target with max_create_count=0: lod_check_and_reserve_ost() /* * We expect number of precreated objects in f_ffree at * the first iteration, skip OSPs with no objects ready */ if (sfs->os_fprecreated == 0 && speed == 0) { QOS_DEBUG( "#%d: precreation is empty\n" , ost_idx); goto out_return; } I'm creating a patch to fix lod_alloc_qos().
            yujian Jian Yu added a comment -

            Sure, Nathan. Let me reproduce and investigate further. Have a nice vacation!

            yujian Jian Yu added a comment - Sure, Nathan. Let me reproduce and investigate further. Have a nice vacation!

            Jian,

            I'm scheduled for vacation through Monday, checking in occasionally, so won't be able to try the reproducer on our system. Besides that, I would rather not remove the workaround we have forcing to round-robin. In fact, if you have suggestions for a better one, please let us know!

            So, if you can take the test case provided and reproduce on one of your test systems or in a VM setup for debugging, that would be preferred. Thanks!

            ndauchy Nathan Dauchy (Inactive) added a comment - Jian, I'm scheduled for vacation through Monday, checking in occasionally, so won't be able to try the reproducer on our system. Besides that, I would rather not remove the workaround we have forcing to round-robin. In fact, if you have suggestions for a better one, please let us know! So, if you can take the test case provided and reproduce on one of your test systems or in a VM setup for debugging, that would be preferred. Thanks!
            yujian Jian Yu added a comment -

            Yes, Andreas. The exact function is lod_alloc_qos(), and in this function, the weighted random allocation algorithm is used. As per the comments, it finds available OSTs and calculates their weights (free space) first, then selects the OSTs the weights used as the probability. An OST with a higher weight is proportionately more likely to be selected than one with a lower weight.

            We can use the QOS_DEBUG() codes in the function to debug the allocation algorithm.

            Hi Nathan,
            Could you please enable QOS debug by running lctl set_param debug='+other', reproduce the issue and gather Lustre debug logs for investigation? Thank you.

            yujian Jian Yu added a comment - Yes, Andreas. The exact function is lod_alloc_qos(), and in this function, the weighted random allocation algorithm is used. As per the comments, it finds available OSTs and calculates their weights (free space) first, then selects the OSTs the weights used as the probability. An OST with a higher weight is proportionately more likely to be selected than one with a lower weight. We can use the QOS_DEBUG() codes in the function to debug the allocation algorithm. Hi Nathan, Could you please enable QOS debug by running lctl set_param debug='+other', reproduce the issue and gather Lustre debug logs for investigation? Thank you.

            People

              yujian Jian Yu
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: