Adding testing output demonstrating the fix :
NOTE: The tests were run on real hardware on Lustre 2.5.1 version and with the fix where separate function was not created for the function calls inside the loop. I hope that's fine.
WITHOUT Fix
=============
For reproducing the issue, a cluster with 12 OSTs and 12 client machines where used. Each client is a 24 cpu cores machine.
IOR load is like the first example from CEAP-82 with smaller i/o transfer and block parameters.
number of threads is 12 * 24, 24 threads per each client.
To increase concurrency, each thread was operating through own lustre mount point, so each client had 24 lustre mounts.
the main script:
[CEAP-82]$ cat ceap-82.sh
#! /bin/bash
MPIEXEC=/home/bloewe/libs/mpich-3.1/install/bin/mpiexec
# EXE=/home/bloewe/benchmarks/mdtest-1.9.3/mdtest
EXE=/home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR
NPROCS=$((12 * 24))
TARGET=/mnt/lustre/ceap-82
MPISCRIPT=/home/bloewe/CEAP-82/mpi-script.sh
HOSTS=$PWD/hostfile
# OPTS="-a POSIX -B -C -E -F -e -g -k -b 4g -t 32m -vvv -o /lustre/crayadm/tmp/testdir.12403/IOR_POSIX"
# OPTS="-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -vvv -o /mnt/lustre/ceap-82/IOR_POSIX"
OPTS="-a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -v"
# mkdir -p distr
#
# for n in {0..100}; do
# rm -f /mnt/lustre/ceap-82/IOR_POSIX*
# $MPIEXEC -f $HOSTS -np $NPROCS $EXE $OPTS
# for x in /mnt/lustre/ceap-82/IOR_POSIX*; do
# lfs getstripe -i $x
# done | sort -n | uniq -c > distr/d-$n
# done
rm -f /mnt/lustre/ceap-82/IOR_POSIX*
$MPIEXEC -f $HOSTS -np $NPROCS $MPISCRIPT $OPTS
for x in /mnt/lustre/ceap-82/IOR_POSIX*; do
lfs getstripe -i $x
done | sort -n | uniq -c
[CEAP-82]$
and an IOR wrapper, needed to map 24 IOR instances to 24 lustre mount points:
[CEAP-82]$ cat mpi-script.sh
#! /bin/bash
LMOUNT=/mnt/lustre$(($PMI_RANK % 24))
exec /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR "$@" -o $LMOUNT/ceap-82/IOR_POSIX
[CEAP-82]$
This way we simulate client load from 24 * 12 = 288 clients.
There result should be equal distribution of IOR working files between all OSTs, 24 files per OST, if RR algorithm works correctly.
However , result of the test run shows uneven files distribution:
Max Write: 5988.51 MiB/sec (6279.41 MB/sec)
Max Read: 10008.34 MiB/sec (10494.51 MB/sec)
Run finished: Sat Mar 21 14:10:12 2015
26 0
25 1
24 2
29 3
20 4
23 5
24 6
25 7
25 8
24 9
23 10
20 11
[CEAP-82]$
number of files per OST varies from 20 to 29. It cannot be explained by "reseeds" in lod_rr_alloc().
The log from MDS ssh session with setting qos_threshold_rr:
[root@orange00 ~]# pdsh -g mds lctl set_param lod.*.qos_threshold_rr=100
orange03: error: set_param: /proc/{fs,sys}/{lnet,lustre}/lod/*/qos_threshold_rr: Found no match
pdsh@orange00: orange03: ssh exited with exit code 3
orange10: lod.orangefs-MDT0001-mdtlov.qos_threshold_rr=100
orange12: lod.orangefs-MDT0003-mdtlov.qos_threshold_rr=100
orange11: lod.orangefs-MDT0002-mdtlov.qos_threshold_rr=100
orange13: lod.orangefs-MDT0004-mdtlov.qos_threshold_rr=100
orange02: lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100
[root@orange00 ~]# slogin orange02
Last login: Sat Mar 21 14:07:41 PDT 2015 from 172.16.2.3 on ssh
[root@orange02 ~]# lctl set_param lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200
error: set_param: /proc/{fs,sys}/{lnet,lustre}/lctl: Found no match
debug=-1
subsystem_debug=lov
debug_mb=1200
[root@orange02 ~]# lctl set_param debug=-1 subsystem_debug=lov debug_mb=1200
debug=-1
subsystem_debug=lov
debug_mb=1200
[root@orange02 ~]# lctl dk > /dev/null
[root@orange02 ~]# lctl dk /tmp/ceap-82.txt
Debug log: 5776 lines, 5776 kept, 0 dropped, 0 bad.
[root@orange02 ~]# lctl get_param lod.*.qos_threshold_rr
lod.orangefs-MDT0000-mdtlov.qos_threshold_rr=100%
[root@orange02 ~]# less /tmp/ceap-82.txt
[root@orange02 ~]# logout
WITH Fix
=======
Following 5 runs showed equal files distribution across all OSTs:
[CEAP-82]$ for n in {1..5} ; do sh ceap-82.sh; done
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Mon Mar 23 02:27:28 2015
Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
Machine: Linux sjsc-321
Summary:
api = POSIX
test filename = /mnt/lustre0/ceap-82/IOR_POSIX
access = file-per-process
ordering in a file = sequential offsets
ordering inter file=constant task offsets = 1
clients = 288 (24 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 200 MiB
aggregate filesize = 56.25 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 4310.41 4310.41 4310.41 0.00 2155.21 2155.21 2155.21 0.00 13.36299 EXCEL
read 10076.77 10076.77 10076.77 0.00 5038.38 5038.38 5038.38 0.00 5.71612 EXCEL
Max Write: 4310.41 MiB/sec (4519.80 MB/sec)
Max Read: 10076.77 MiB/sec (10566.26 MB/sec)
Run finished: Mon Mar 23 02:27:48 2015
24 0
24 1
24 2
24 3
24 4
24 5
24 6
24 7
24 8
24 9
24 10
24 11
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Mon Mar 23 02:27:50 2015
Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
Machine: Linux sjsc-321
Summary:
api = POSIX
test filename = /mnt/lustre0/ceap-82/IOR_POSIX
access = file-per-process
ordering in a file = sequential offsets
ordering inter file=constant task offsets = 1
clients = 288 (24 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 200 MiB
aggregate filesize = 56.25 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 4269.62 4269.62 4269.62 0.00 2134.81 2134.81 2134.81 0.00 13.49066 EXCEL
read 10078.78 10078.78 10078.78 0.00 5039.39 5039.39 5039.39 0.00 5.71498 EXCEL
Max Write: 4269.62 MiB/sec (4477.02 MB/sec)
Max Read: 10078.78 MiB/sec (10568.37 MB/sec)
Run finished: Mon Mar 23 02:28:10 2015
24 0
24 1
24 2
24 3
24 4
24 5
24 6
24 7
24 8
24 9
24 10
24 11
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Mon Mar 23 02:28:11 2015
Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
Machine: Linux sjsc-321
Summary:
api = POSIX
test filename = /mnt/lustre0/ceap-82/IOR_POSIX
access = file-per-process
ordering in a file = sequential offsets
ordering inter file=constant task offsets = 1
clients = 288 (24 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 200 MiB
aggregate filesize = 56.25 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 4343.86 4343.86 4343.86 0.00 2171.93 2171.93 2171.93 0.00 13.26011 EXCEL
read 10091.59 10091.59 10091.59 0.00 5045.80 5045.80 5045.80 0.00 5.70772 EXCEL
Max Write: 4343.86 MiB/sec (4554.86 MB/sec)
Max Read: 10091.59 MiB/sec (10581.80 MB/sec)
Run finished: Mon Mar 23 02:28:31 2015
24 0
24 1
24 2
24 3
24 4
24 5
24 6
24 7
24 8
24 9
24 10
24 11
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Mon Mar 23 02:28:33 2015
Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
Machine: Linux sjsc-321
Summary:
api = POSIX
test filename = /mnt/lustre0/ceap-82/IOR_POSIX
access = file-per-process
ordering in a file = sequential offsets
ordering inter file=constant task offsets = 1
clients = 288 (24 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 200 MiB
aggregate filesize = 56.25 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 4169.55 4169.55 4169.55 0.00 2084.78 2084.78 2084.78 0.00 13.81443 EXCEL
read 9997.98 9997.98 9997.98 0.00 4998.99 4998.99 4998.99 0.00 5.76116 EXCEL
Max Write: 4169.55 MiB/sec (4372.09 MB/sec)
Max Read: 9997.98 MiB/sec (10483.64 MB/sec)
Run finished: Mon Mar 23 02:28:53 2015
24 0
24 1
24 2
24 3
24 4
24 5
24 6
24 7
24 8
24 9
24 10
24 11
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Mon Mar 23 02:28:56 2015
Command line used: /home/bloewe/benchmarks/IOR-2.10.3/src/C/IOR -a POSIX -B -C -E -F -e -g -k -b 200m -t 2m -o /mnt/lustre0/ceap-82/IOR_POSIX
Machine: Linux sjsc-321
Summary:
api = POSIX
test filename = /mnt/lustre0/ceap-82/IOR_POSIX
access = file-per-process
ordering in a file = sequential offsets
ordering inter file=constant task offsets = 1
clients = 288 (24 per node)
repetitions = 1
xfersize = 2 MiB
blocksize = 200 MiB
aggregate filesize = 56.25 GiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
--------- --------- --------- ---------- ------- --------- --------- ---------- ------- --------
write 4427.20 4427.20 4427.20 0.00 2213.60 2213.60 2213.60 0.00 13.01049 EXCEL
read 10026.62 10026.62 10026.62 0.00 5013.31 5013.31 5013.31 0.00 5.74471 EXCEL
Max Write: 4427.20 MiB/sec (4642.25 MB/sec)
Max Read: 10026.62 MiB/sec (10513.67 MB/sec)
Run finished: Mon Mar 23 02:29:15 2015
24 0
24 1
24 2
24 3
24 4
24 5
24 6
24 7
24 8
24 9
24 10
24 11
[CEAP-82]$
Landed for 2.8