[LU-977] incorrect round robin object allocation - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: None
Labels:
- llnl
- patch
Environment:
any lustre from a 1.6.0

Severity:
3
Bugzilla ID:
24,194
Rank (Obsolete):
7276

Description

https://bugzilla.lustre.org/show_bug.cgi?id=24194

bug issued due incorrect locking in lov_qos code and can be easy replicated by test

diff --git a/lustre/lov/lov_qos.c b/lustre/lov/lov_qos.c 
index a101e9c..64ccefb 100644 
--- a/lustre/lov/lov_qos.c 
+++ b/lustre/lov/lov_qos.c 
@@ -627,6 +627,8 @@ static int alloc_rr(struct lov_obd *lov, int *idx_arr, int *stripe_cnt, 

 repeat_find: 
         array_idx = (lqr->lqr_start_idx + lqr->lqr_offset_idx) % osts->op_count; 
+ CFS_FAIL_TIMEOUT_MS(OBD_FAIL_MDS_LOV_CREATE_RACE, 100); 
+ 
         idx_pos = idx_arr; 
 #ifdef QOS_DEBUG 
         CDEBUG(D_QOS, "pool '%s' want %d startidx %d startcnt %d offset %d "

test_51() {
        local obj1
        local obj2
        local old_rr

        mkdir -p $DIR1/$tfile-1/
        mkdir -p $DIR2/$tfile-2/
        old_rr=$(do_facet $SINGLEMDS lctl get_param -n 'lov.lustre-MDT*/qos_threshold_rr' | sed -e
's/%//')
        do_facet $SINGLEMDS lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' 100
#define OBD_FAIL_MDS_LOV_CREATE_RACE     0x148
        do_facet $SINGLEMDS "lctl set_param fail_loc=0x80000148"
        touch $DIR1/$tfile-1/file1 &
        PID1=$!
        touch $DIR2/$tfile-2/file2 &
        PID2=$!
        wait $PID2
        wait $PID1
        do_facet $SINGLEMDS "lctl set_param fail_loc=0x0"
        do_facet $SINGLEMDS "lctl set_param -n 'lov.lustre-MDT*/qos_threshold_rr' $old_rr"

        obj1=$($GETSTRIPE -o $DIR1/$tfile-1/file1)
        obj2=$($GETSTRIPE -o $DIR1/$tfile-2/file2)
        [ $obj1 -eq $obj2 ] && error "must different ost used"
}
run_test 51 "alloc_rr should be allocate on correct order"

bug found in 2.x but should be exist in 1.8 also.

CFS_FAIL_TIMEOUT_MS can be replaced with CFS_RACE()

Attachments

Issue Links

is related to

LU-9780 Add test for fix added in LU-977

Resolved

LU-14377 parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''

Resolved

LU-9 Optimize weighted QOS Round-Robin allocator

Open

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Alexey Lyashkov

Votes:: 0 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 10/Jan/12 1:33 AM

Updated:: 19/Oct/22 1:03 AM

Resolved:: 21/Aug/15 10:10 PM