[LU-977] incorrect round robin object allocation - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- llnl
- patch
Environment:
any lustre from a 1.6.0

Severity:
3
Bugzilla ID:
24,194
Rank (Obsolete):
7276

Description

https://bugzilla.lustre.org/show_bug.cgi?id=24194

bug issued due incorrect locking in lov_qos code and can be easy replicated by test

diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c 
index a101e9c..64ccefb 100644 
--- a/lustre/lov/lov_qos.c 
+++ b/lustre/lov/lov_qos.c 
@@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, 

 repeat_find: 
         array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; 
+ CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); 
+ 
         idx_pos = idx_arr; 
 #ifdef QOS_DEBUG 
         CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d "

test_51() {
        local obj1
        local obj2
        local old_rr

        mkdir -p $DIR1/$tfile-1/
        mkdir -p $DIR2/$tfile-2/
        old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e
's/%//')
        do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100
#define OBD_FAIL_MDS_LOV_CREATE_RACE     0x148
        do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148"
        touch $DIR1/$tfile-1/file1 &
        PID1=$!
        touch $DIR2/$tfile-2/file2 &
        PID2=$!
        wait $PID2
        wait $PID1
        do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
        do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr"

        obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1)
        obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2)
        [ $obj1 -eq $obj2 ] && error "must different ost used"
}
run_test 51 "alloc_rr should be allocate on correct order"

bug found in 2.x but should be exist in 1.8 also.

CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()

Attachments

Issue Links

is related to

LU-9780 Add test for fix added in LU-977

Resolved

LU-14377 parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''

Resolved

LU-9 Optimize weighted QOS Round-Robin allocator

Open

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-977] incorrect round robin object allocation

Alex Zhuravlev added a comment - 20/Mar/13 11:01 AM

totally precise RR is not possible with DNE, for example.

Alex Zhuravlev added a comment - 20/Mar/13 11:01 AM totally precise RR is not possible with DNE, for example.

Alexey Lyashkov added a comment - 19/Mar/13 5:38 AM

Eugene Birkine added a comment - 06/Dec/11 9:05 PM
Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was:
testfs-OST0000
2
testfs-OST0001
3
testfs-OST0002
2
testfs-OST0003
1
testfs-OST0004
2
testfs-OST0005
3
testfs-OST0006
1
testfs-OST0007
2

Alexey Lyashkov added a comment - 19/Mar/13 5:38 AM Eugene Birkine added a comment - 06/Dec/11 9:05 PM Debug log file from MDS with qos_threshold_rr=100 during 16 file writes. The file distribution was: testfs-OST0000 2 testfs-OST0001 3 testfs-OST0002 2 testfs-OST0003 1 testfs-OST0004 2 testfs-OST0005 3 testfs-OST0006 1 testfs-OST0007 2

Keith Mannthey (Inactive) added a comment - 19/Mar/13 12:27 AM

Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

Keith Mannthey (Inactive) added a comment - 19/Mar/13 12:27 AM Alexey, What is the worst case allocation that you have seen? It still sounds like you want a "totally precise" client / ost allocation mapping.

Alexey Lyashkov added a comment - 15/Mar/13 7:04 AM

did you have plans to fix it?

Alexey Lyashkov added a comment - 15/Mar/13 7:04 AM did you have plans to fix it?

Alexey Lyashkov added a comment - 14/Jan/13 6:23 AM

Alex,

about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

Alexey Lyashkov added a comment - 14/Jan/13 6:23 AM Alex, about second, i mean if we have 20 allocations and 5 ost's - we need to have 4 allocations on each ost's - otherwise that is isn't round-robin allocation. and we have more load to same one or more ost's with same workload pattern.

Alex Zhuravlev added a comment - 13/Jan/13 4:49 AM

the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

Alex Zhuravlev added a comment - 13/Jan/13 4:49 AM the 2nd requirement can't be achieved just because object doesn't imply same amount of data and IO pattern. so, I don't think some variation will be that bad.

Alexey Lyashkov added a comment - 13/Jan/13 4:44 AM

Alex,

we have two requirements
1) MD object should be an allocate objects from different OST's, and avoid situation when two objects from same ost assigned to one MD object (will reduce speed).
2) whole allocation should be distribute ost objects evenly over all ost's - again we need evenly load for a all ost's.

Alexey Lyashkov added a comment - 13/Jan/13 4:44 AM Alex, we have two requirements 1) MD object should be an allocate objects from different OST's, and avoid situation when two objects from same ost assigned to one MD object (will reduce speed). 2) whole allocation should be distribute ost objects evenly over all ost's - again we need evenly load for a all ost's.

Alex Zhuravlev added a comment - 11/Jan/13 6:20 AM

again, I think there is no requirement for the algorithm to be totally precise.. and if for some reason you want serialization, just do not shift - take and increment current lqr_start_idx on the every iteration.

Alex Zhuravlev added a comment - 11/Jan/13 6:20 AM again, I think there is no requirement for the algorithm to be totally precise.. and if for some reason you want serialization, just do not shift - take and increment current lqr_start_idx on the every iteration.

Alex Zhuravlev added a comment - 11/Jan/13 5:09 AM

well, statfs() is basically a memcpy() in this case.

Alex Zhuravlev added a comment - 11/Jan/13 5:09 AM well, statfs() is basically a memcpy() in this case.

Alexey Lyashkov added a comment - 11/Jan/13 5:06 AM - edited

may be that is solution - because original problem when we have isn't same allocation on whole OST in cluster.
other notices from my view - we may kill a statfs from that loop - because that is too slow operation in fast path.

PS. was wrong. that is not a solution - because we may shift for whole loop when release a spinlock, so will allocate two objects on same ost for one file.

Alexey Lyashkov added a comment - 11/Jan/13 5:06 AM - edited may be that is solution - because original problem when we have isn't same allocation on whole OST in cluster. other notices from my view - we may kill a statfs from that loop - because that is too slow operation in fast path. PS. was wrong. that is not a solution - because we may shift for whole loop when release a spinlock, so will allocate two objects on same ost for one file.

Alex Zhuravlev added a comment - 08/Jan/13 4:04 AM

this can not be done easily because lod_alloc_rr() is doing allocation within that loop, so we can't put the whole loop under a spinlock.

but probably we can shift lqr_start_idx to the next OST when another OST is used in the striping:

We've successfuly declared (reserved) an object
*/
lod_qos_ost_in_use(env, stripe_idx, ost_idx);
lo->ldo_stripe[stripe_idx] = o;
stripe_idx++;
+ spin_lock(...);
+ lqr->lqr_start_idx = next(ost_idx);
+ spin_lock(...);

I don't think QoS is supposed to be absolutely reliable in terms of "X is used, move to Y". some small "mistakes" and variation should be OK, IMHO.

as for the second problem I'd like to see a bit better description, if possible.

Alex Zhuravlev added a comment - 08/Jan/13 4:04 AM this can not be done easily because lod_alloc_rr() is doing allocation within that loop, so we can't put the whole loop under a spinlock. but probably we can shift lqr_start_idx to the next OST when another OST is used in the striping: We've successfuly declared (reserved) an object */ lod_qos_ost_in_use(env, stripe_idx, ost_idx); lo->ldo_stripe [stripe_idx] = o; stripe_idx++; + spin_lock(...); + lqr->lqr_start_idx = next(ost_idx); + spin_lock(...); I don't think QoS is supposed to be absolutely reliable in terms of "X is used, move to Y". some small "mistakes" and variation should be OK, IMHO. as for the second problem I'd like to see a bit better description, if possible.

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 10/Jan/12 1:33 AM

Updated:: 19/Oct/22 1:03 AM

Resolved:: 21/Aug/15 10:10 PM