Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-977

incorrect round robin object allocation

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • None
    • any lustre from a 1.6.0
    • 3
    • 24,194
    • 7276

    Description

      https://bugzilla.lustre.org/show_bug.cgi?id=24194

      bug issued due incorrect locking in lov_qos code and can be easy replicated by test

      diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c 
      index a101e9c..64ccefb 100644 
      --- a/lustre/lov/lov_qos.c 
      +++ b/lustre/lov/lov_qos.c 
      @@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, 
      
       repeat_find: 
               array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; 
      + CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); 
      + 
               idx_pos = idx_arr; 
       #ifdef QOS_DEBUG 
               CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d "
      
      test_51() {
              local obj1
              local obj2
              local old_rr
      
              mkdir -p $DIR1/$tfile-1/
              mkdir -p $DIR2/$tfile-2/
              old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e
      's/%//')
              do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100
      #define OBD_FAIL_MDS_LOV_CREATE_RACE     0x148
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148"
              touch $DIR1/$tfile-1/file1 &
              PID1=$!
              touch $DIR2/$tfile-2/file2 &
              PID2=$!
              wait $PID2
              wait $PID1
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
              do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr"
      
              obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1)
              obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2)
              [ $obj1 -eq $obj2 ] && error "must different ost used"
      }
      run_test 51 "alloc_rr should be allocate on correct order"
      

      bug found in 2.x but should be exist in 1.8 also.

      CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()

      Attachments

        Issue Links

          Activity

            [LU-977] incorrect round robin object allocation

            Posted the new patch to address this issue. Please review.

            520557 Rahul Deshmukh (Inactive) added a comment - Posted the new patch to address this issue. Please review.

            Rahul Deshmukh (rahul.deshmukh@seagate.com) uploaded a new patch: http://review.whamcloud.com/14636
            Subject: LU-977 lod: Patch to protect lqr_start_idx
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 86aa10e5b8c6b944c10ce224078ba4f6aafbe6eb

            gerrit Gerrit Updater added a comment - Rahul Deshmukh (rahul.deshmukh@seagate.com) uploaded a new patch: http://review.whamcloud.com/14636 Subject: LU-977 lod: Patch to protect lqr_start_idx Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 86aa10e5b8c6b944c10ce224078ba4f6aafbe6eb

            Cory, notice "totally", which is possible only with very strong locking around allocation, IMHO. which is virtually not possible with DNE and not very good on multicore?

            bzzz Alex Zhuravlev added a comment - Cory, notice "totally", which is possible only with very strong locking around allocation, IMHO. which is virtually not possible with DNE and not very good on multicore?

            If i correctly understand Alex, they mean different MDT may allocate on same OST so OST may have a different allocated objects. But it's false if each MDT have own OST pools assigned.

            shadow Alexey Lyashkov added a comment - If i correctly understand Alex, they mean different MDT may allocate on same OST so OST may have a different allocated objects. But it's false if each MDT have own OST pools assigned.
            spitzcor Cory Spitz added a comment -

            I don't see how 'precise' RR is not possible with DNE. If an application wants evenly balanced stripe allocation, that should still be possible as the allocators aren't linked in DNE. So then if the one MDS allocator hasn't switched to the space-based allocator, then round-robin should still be (mostly) 'precise', correct?

            spitzcor Cory Spitz added a comment - I don't see how 'precise' RR is not possible with DNE. If an application wants evenly balanced stripe allocation, that should still be possible as the allocators aren't linked in DNE. So then if the one MDS allocator hasn't switched to the space-based allocator, then round-robin should still be (mostly) 'precise', correct?

            totally precise RR is not possible with DNE, for example.

            bzzz Alex Zhuravlev added a comment - totally precise RR is not possible with DNE, for example.
            Eugene Birkine added a comment - 06/Dec/11 9:05 PM
            Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was:
            testfs-OST0000
            2
            testfs-OST0001
            3
            testfs-OST0002
            2
            testfs-OST0003
            1
            testfs-OST0004
            2
            testfs-OST0005
            3
            testfs-OST0006
            1
            testfs-OST0007
            2
            
            shadow Alexey Lyashkov added a comment - Eugene Birkine added a comment - 06/Dec/11 9:05 PM Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was: testfs-OST0000 2 testfs-OST0001 3 testfs-OST0002 2 testfs-OST0003 1 testfs-OST0004 2 testfs-OST0005 3 testfs-OST0006 1 testfs-OST0007 2

            Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

            keith Keith Mannthey (Inactive) added a comment - Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

            did you have plans to fix it?

            shadow Alexey Lyashkov added a comment - did you have plans to fix it?

            Alex,

            about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

            shadow Alexey Lyashkov added a comment - Alex, about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

            the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

            bzzz Alex Zhuravlev added a comment - the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

            People

              bogl Bob Glossman (Inactive)
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: