Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.1, Lustre 2.15.3
-
None
-
ZFS
-
3
-
9223372036854775807
Description
parallel-scale test_rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''. It looks like this test has been failing since at least 22 NOV 2019 with
Lustre 2.12.3.31 - https://testing.whamcloud.com/test_sets/b8fa5b22-0d76-11ea-98f1-52540065bddc
Lustre 2.13.51.72 - https://testing.whamcloud.com/test_sets/32e5a306-44f4-11ea-8072-52540065bddc
Since 2020-06-11, this test is occasionally failing.
Looking at the test_suite log for the failure at https://testing.whamcloud.com/test_sets/875e0375-cc23-4f0f-8291-f4f9034e340c, we see
+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -np 22 /usr/lib64/openmpi/bin/rr_alloc /tmp/rr_alloc_mntpt/lustre/drr_alloc.parallel-scale/ash 555 2 " CMD: trevis-63vm4 /usr/sbin/lctl set_param -n lod.lustre-MDT0000-mdtlov.qos_threshold_rr=17% CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0000-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0001-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0002-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0003-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0004-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0005-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0006-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0007-osc-MDT0000.create_count=2048 parallel-scale test_rr_alloc: @@@@@@ FAIL: Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6273:error() = /usr/lib64/lustre/tests/functions.sh:1120:run_rr_alloc() = /usr/lib64/lustre/tests/parallel-scale.sh:163:test_rr_alloc()
which, except for the error message, is the same command and values for create_count seen when this test passes.
From functions.sh/run_rr_alloc(), here is how we compute the differences between number of stripes created per OST
1095 if [[ $total_MNTPTS -ne 0 ]]; then 1096 # Now start the actual file creation app. 1097 mpi_run "-np $total_MNTPTS" $cmd || return 1098 else 1099 error "No mount point" 1100 fi 1101 1102 restore_lustre_params < $qos_prec_objs 1103 rm -f $qos_prec_objs 1104 1105 diff_max_min_arr=($($LFS getstripe -r $DIR/$tdir/ | 1106 grep "lmm_stripe_offset:" | awk '{print $2}' | sort -n | 1107 uniq -c | awk 'NR==1 {min=max=$1} \ 1108 { $1<min ? min=$1 : min; $1>max ? max=$1 : max} \ 1109 END {print max-min, max, min}')) 1110 1111 rm -rf $DIR/$tdir 1112 1113 # In-case of fairly large number of file creation using RR (round-robin) 1114 # there can be two cases in which deviation will occur than the regular 1115 # RR algo behaviour- 1116 # 1- When rr_alloc does not start right with 'lqr_start_count' reseeded, 1117 # 2- When rr_alloc does not finish with 'lqr_start_count == 0'. 1118 # So the difference of files b/w any 2 OST should not be more than 2. 1119 [[ ${diff_max_min_arr[0]} -le 2 ]] || 1120 error "Uneven distribution detected: difference between" \ 1121 "maximum files per OST (${diff_max_min_arr[1]}) and" \ 1122 "minimum files per OST (${diff_max_min_arr[2]}) must not be" \ 1123 "greater than 2"
Attachments
Issue Links
- is related to
-
LU-977 incorrect round robin object allocation
- Resolved
-
LU-17251 parallel-scale test_rr_alloc: max/min OST objects (2800 : 923) too different
- Resolved
- is related to
-
LU-9780 Add test for fix added in LU-977
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...