[LU-14377] parallel-scale test rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2'' Created: 28/Jan/21 Updated: 03/Nov/23 Resolved: 03/Nov/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.4, Lustre 2.15.1, Lustre 2.15.3 |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
ZFS |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
parallel-scale test_rr_alloc fails with ''Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2''. It looks like this test has been failing since at least 22 NOV 2019 with Since 2020-06-11, this test is occasionally failing. Looking at the test_suite log for the failure at https://testing.whamcloud.com/test_sets/875e0375-cc23-4f0f-8291-f4f9034e340c, we see + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -np 22 /usr/lib64/openmpi/bin/rr_alloc /tmp/rr_alloc_mntpt/lustre/drr_alloc.parallel-scale/ash 555 2 " CMD: trevis-63vm4 /usr/sbin/lctl set_param -n lod.lustre-MDT0000-mdtlov.qos_threshold_rr=17% CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0000-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0001-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0002-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0003-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0004-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0005-osc-MDT0000.create_count=2048 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0006-osc-MDT0000.create_count=1024 CMD: trevis-63vm4 /usr/sbin/lctl set_param -n osp.lustre-OST0007-osc-MDT0000.create_count=2048 parallel-scale test_rr_alloc: @@@@@@ FAIL: Uneven distribution detected: difference between maximum files per OST (1528) and minimum files per OST (1525) must not be greater than 2 Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:6273:error() = /usr/lib64/lustre/tests/functions.sh:1120:run_rr_alloc() = /usr/lib64/lustre/tests/parallel-scale.sh:163:test_rr_alloc() which, except for the error message, is the same command and values for create_count seen when this test passes. From functions.sh/run_rr_alloc(), here is how we compute the differences between number of stripes created per OST 1095 if [[ $total_MNTPTS -ne 0 ]]; then 1096 # Now start the actual file creation app. 1097 mpi_run "-np $total_MNTPTS" $cmd || return 1098 else 1099 error "No mount point" 1100 fi 1101 1102 restore_lustre_params < $qos_prec_objs 1103 rm -f $qos_prec_objs 1104 1105 diff_max_min_arr=($($LFS getstripe -r $DIR/$tdir/ | 1106 grep "lmm_stripe_offset:" | awk '{print $2}' | sort -n | 1107 uniq -c | awk 'NR==1 {min=max=$1} \ 1108 { $1<min ? min=$1 : min; $1>max ? max=$1 : max} \ 1109 END {print max-min, max, min}')) 1110 1111 rm -rf $DIR/$tdir 1112 1113 # In-case of fairly large number of file creation using RR (round-robin) 1114 # there can be two cases in which deviation will occur than the regular 1115 # RR algo behaviour- 1116 # 1- When rr_alloc does not start right with 'lqr_start_count' reseeded, 1117 # 2- When rr_alloc does not finish with 'lqr_start_count == 0'. 1118 # So the difference of files b/w any 2 OST should not be more than 2. 1119 [[ ${diff_max_min_arr[0]} -le 2 ]] || 1120 error "Uneven distribution detected: difference between" \ 1121 "maximum files per OST (${diff_max_min_arr[1]}) and" \ 1122 "minimum files per OST (${diff_max_min_arr[2]}) must not be" \ 1123 "greater than 2" |
| Comments |
| Comment by Nikitas Angelinas [ 16/May/22 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/d09310fb-2944-46a6-84e3-67634a12f39d |
| Comment by Sarah Liu [ 28/Jul/22 ] |
|
+1 in 2.15.1 https://testing.whamcloud.com/test_sets/59410f79-0bed-4fd3-97fc-e80941e5a00c |
| Comment by Gerrit Updater [ 19/Oct/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48914 |
| Comment by Gerrit Updater [ 08/Nov/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48914/ |
| Comment by Peter Jones [ 08/Nov/22 ] |
|
Landed for 2.16 |
| Comment by Gerrit Updater [ 25/May/23 ] |
|
"Minh Diep <mdiep@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51142 |
| Comment by Gerrit Updater [ 02/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51142/ |
| Comment by Andreas Dilger [ 28/Aug/23 ] |
|
This is still failing regularly, often with very large differences between the most and least used OSTs (i.e. 90% difference instead of just 2-3% difference). So it seems either that the test as written is unreliable (creating too many objects, running on an already-imbalanced system, etc.), or there is a new bug in the code. |
| Comment by Andreas Dilger [ 03/Nov/23 ] |
|
Patches already landed for this ticket, new ticket LU-17251 tracking new patches. |