Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
any lustre from a 1.6.0
-
3
-
24,194
-
7276
Description
https://bugzilla.lustre.org/show_bug.cgi?id=24194
bug issued due incorrect locking in lov_qos code and can be easy replicated by test
diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c index a101e9c..64ccefb 100644 --- a/lustre/lov/lov_qos.c +++ b/lustre/lov/lov_qos.c @@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, repeat_find: array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; + CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); + idx_pos = idx_arr; #ifdef QOS_DEBUG CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d " test_51() { local obj1 local obj2 local old_rr mkdir -p $DIR1/$tfile-1/ mkdir -p $DIR2/$tfile-2/ old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e 's/%//') do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100 #define OBD_FAIL_MDS_LOV_CREATE_RACE 0x148 do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148" touch $DIR1/$tfile-1/file1 & PID1=$! touch $DIR2/$tfile-2/file2 & PID2=$! wait $PID2 wait $PID1 do_facet $SINGLEMDS "lctl set_param fail_loc=0x0" do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr" obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1) obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2) [ $obj1 -eq $obj2 ] && error "must different ost used" } run_test 51 "alloc_rr should be allocate on correct order"
bug found in 2.x but should be exist in 1.8 also.
CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()
Attachments
Issue Links
- is related to
-
LU-9780 Add test for fix added in LU-977
-
- Resolved
-
-
LU-14377 parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''
-
- Resolved
-
-
LU-9 Optimize weighted QOS Round-Robin allocator
-
- Open
-
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
I don't see how 'precise' RR is not possible with DNE. If an application wants evenly balanced stripe allocation, that should still be possible as the allocators aren't linked in DNE. So then if the one MDS allocator hasn't switched to the space-based allocator, then round-robin should still be (mostly) 'precise', correct?