Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-977

incorrect round robin object allocation

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • None
    • any lustre from a 1.6.0
    • 3
    • 24,194
    • 7276

    Description

      https://bugzilla.lustre.org/show_bug.cgi?id=24194

      bug issued due incorrect locking in lov_qos code and can be easy replicated by test

      diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c 
      index a101e9c..64ccefb 100644 
      --- a/lustre/lov/lov_qos.c 
      +++ b/lustre/lov/lov_qos.c 
      @@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, 
      
       repeat_find: 
               array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; 
      + CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); 
      + 
               idx_pos = idx_arr; 
       #ifdef QOS_DEBUG 
               CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d "
      
      test_51() {
              local obj1
              local obj2
              local old_rr
      
              mkdir -p $DIR1/$tfile-1/
              mkdir -p $DIR2/$tfile-2/
              old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e
      's/%//')
              do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100
      #define OBD_FAIL_MDS_LOV_CREATE_RACE     0x148
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148"
              touch $DIR1/$tfile-1/file1 &
              PID1=$!
              touch $DIR2/$tfile-2/file2 &
              PID2=$!
              wait $PID2
              wait $PID1
              do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
              do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr"
      
              obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1)
              obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2)
              [ $obj1 -eq $obj2 ] && error "must different ost used"
      }
      run_test 51 "alloc_rr should be allocate on correct order"
      

      bug found in 2.x but should be exist in 1.8 also.

      CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()

      Attachments

        Issue Links

          Activity

            [LU-977] incorrect round robin object allocation

            totally precise RR is not possible with DNE, for example.

            bzzz Alex Zhuravlev added a comment - totally precise RR is not possible with DNE, for example.
            Eugene Birkine added a comment - 06/Dec/11 9:05 PM
            Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was:
            testfs-OST0000
            2
            testfs-OST0001
            3
            testfs-OST0002
            2
            testfs-OST0003
            1
            testfs-OST0004
            2
            testfs-OST0005
            3
            testfs-OST0006
            1
            testfs-OST0007
            2
            
            shadow Alexey Lyashkov added a comment - Eugene Birkine added a comment - 06/Dec/11 9:05 PM Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was: testfs-OST0000 2 testfs-OST0001 3 testfs-OST0002 2 testfs-OST0003 1 testfs-OST0004 2 testfs-OST0005 3 testfs-OST0006 1 testfs-OST0007 2

            Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

            keith Keith Mannthey (Inactive) added a comment - Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

            did you have plans to fix it?

            shadow Alexey Lyashkov added a comment - did you have plans to fix it?

            Alex,

            about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

            shadow Alexey Lyashkov added a comment - Alex, about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

            the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

            bzzz Alex Zhuravlev added a comment - the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

            Alex,

            we have two requirements
            1) MD object should be an allocate objects from different OST's, and avoid situation when two objects from same ost assigned to one MD object (will reduce speed).
            2) whole allocation should be distribute ost objects evenly over all ost's - again we need evenly load for a all ost's.

            shadow Alexey Lyashkov added a comment - Alex, we have two requirements 1) MD object should be an allocate objects from different OST's, and avoid situation when two objects from same ost assigned to one MD object (will reduce speed). 2) whole allocation should be distribute ost objects evenly over all ost's - again we need evenly load for a all ost's.

            again, I think there is no requirement for the algorithm to be totally precise.. and if for some reason you want serialization, just do not shift - take and increment current lqr_start_idx on the every iteration.

            bzzz Alex Zhuravlev added a comment - again, I think there is no requirement for the algorithm to be totally precise.. and if for some reason you want serialization, just do not shift - take and increment current lqr_start_idx on the every iteration.

            well, statfs() is basically a memcpy() in this case.

            bzzz Alex Zhuravlev added a comment - well, statfs() is basically a memcpy() in this case.
            shadow Alexey Lyashkov added a comment - - edited

            may be that is solution - because original problem when we have isn't same allocation on whole OST in cluster.
            other notices from my view - we may kill a statfs from that loop - because that is too slow operation in fast path.

            PS. was wrong. that is not a solution - because we may shift for whole loop when release a spinlock, so will allocate two objects on same ost for one file.

            shadow Alexey Lyashkov added a comment - - edited may be that is solution - because original problem when we have isn't same allocation on whole OST in cluster. other notices from my view - we may kill a statfs from that loop - because that is too slow operation in fast path. PS. was wrong. that is not a solution - because we may shift for whole loop when release a spinlock, so will allocate two objects on same ost for one file.

            this can not be done easily because lod_alloc_rr() is doing allocation within that loop, so we can't put the whole loop under a spinlock.

            but probably we can shift lqr_start_idx to the next OST when another OST is used in the striping:

            • We've successfuly declared (reserved) an object
              */
              lod_qos_ost_in_use(env, stripe_idx, ost_idx);
              lo->ldo_stripe[stripe_idx] = o;
              stripe_idx++;
              + spin_lock(...);
              + lqr->lqr_start_idx = next(ost_idx);
              + spin_lock(...);

            I don't think QoS is supposed to be absolutely reliable in terms of "X is used, move to Y". some small "mistakes" and variation should be OK, IMHO.

            as for the second problem I'd like to see a bit better description, if possible.

            bzzz Alex Zhuravlev added a comment - this can not be done easily because lod_alloc_rr() is doing allocation within that loop, so we can't put the whole loop under a spinlock. but probably we can shift lqr_start_idx to the next OST when another OST is used in the striping: We've successfuly declared (reserved) an object */ lod_qos_ost_in_use(env, stripe_idx, ost_idx); lo->ldo_stripe [stripe_idx] = o; stripe_idx++; + spin_lock(...); + lqr->lqr_start_idx = next(ost_idx); + spin_lock(...); I don't think QoS is supposed to be absolutely reliable in terms of "X is used, move to Y". some small "mistakes" and variation should be OK, IMHO. as for the second problem I'd like to see a bit better description, if possible.

            People

              bogl Bob Glossman (Inactive)
              shadow Alexey Lyashkov
              Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: