[LU-2659] single client throughput for 10GigE Created: 21/Jan/13 Updated: 27/Aug/13 Resolved: 27/Aug/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Frederik Ferner (Inactive) | Assignee: | Minh Diep |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | ptr | ||
| Attachments: |
|
| Rank (Obsolete): | 6208 |
| Description |
|
(I'm not sure about the issue type for this ticket, please adjust as appropriate.) As discussed with Peter Jones, we are trying to implement a file system where single clients can achieve >900MB/s write throughput over 10GigE connections. Ideally single 10GigE for the clients but 2x10GigE LACP bonding might be an option. The OSSes will initially have 4x 10GigE LACP bonded links, though for some initial testing we might start with fewer links. The disk backend has now arrived and this is a sample obdfilter-survey result using all one OST and 4 OSSes, without much tuning on the OSS nodes yet. The OSSes are all running Lustre 2.3.0 on RHEL6. Sat Jan 19 15:49:23 GMT 2013 Obdfilter-survey for case=disk from cs04r-sc-oss05-03.diamond.ac.uk ost 41 sz 687865856K rsz 1024K obj 41 thr 41 write 2975.14 [ 40.00, 105.99] rewrite 2944.84 [ 22.00, 118.99] read 8104.33 [ 40.99, 231.98] ost 41 sz 687865856K rsz 1024K obj 41 thr 82 write 5231.39 [ 49.99, 167.98] rewrite 4984.58 [ 29.98, 171.89] read 13807.08 [ 161.99, 514.92] ost 41 sz 687865856K rsz 1024K obj 41 thr 164 write 9445.93 [ 82.99, 293.98] rewrite 9722.32 [ 149.98, 324.96] read 17851.10 [ 191.97, 869.92] ost 41 sz 687865856K rsz 1024K obj 41 thr 328 write 15872.41 [ 265.96, 533.94] rewrite 16682.58 [ 245.97, 526.97] read 19312.61 [ 184.98, 794.93] ost 41 sz 687865856K rsz 1024K obj 41 thr 656 write 18704.47 [ 222.98, 651.94] rewrite 18733.29 [ 252.90, 634.83] read 21040.28 [ 260.98, 808.92] ost 41 sz 687865856K rsz 1024K obj 41 thr 1312 write 18291.71 [ 161.99, 740.93] rewrite 18443.63 [ 47.00, 704.91] read 20683.56 [ 178.99, 908.91] ost 41 sz 687865856K rsz 1024K obj 41 thr 2624 write 18704.50 [ 19.00, 684.92] rewrite 18583.81 [ 25.00, 729.92] read 20400.08 [ 110.99, 982.88] ost 41 sz 687865856K rsz 1024K obj 82 thr 82 write 5634.08 [ 62.99, 176.98] rewrite 4640.45 [ 55.00, 162.98] read 9459.26 [ 114.98, 320.99] ost 41 sz 687865856K rsz 1024K obj 82 thr 164 write 9615.85 [ 95.99, 308.98] rewrite 8329.19 [ 122.99, 275.99] read 13967.03 [ 150.99, 430.97] ost 41 sz 687865856K rsz 1024K obj 82 thr 328 write 13846.63 [ 229.99, 461.97] rewrite 12576.55 [ 186.98, 390.97] read 18166.27 [ 130.99, 557.94] ost 41 sz 687865856K rsz 1024K obj 82 thr 656 write 18558.35 [ 268.98, 624.93] rewrite 16821.93 [ 246.85, 542.95] read 19645.73 [ 235.85, 676.92] ost 41 sz 687865856K rsz 1024K obj 82 thr 1312 write 18885.19 [ 117.99, 690.92] rewrite 16501.04 [ 115.99, 617.95] read 19255.26 [ 180.97, 832.89] ost 41 sz 687865856K rsz 1024K obj 82 thr 2624 write 18991.31 [ 127.51, 784.92] rewrite 18111.05 [ 31.00, 763.88] read 20333.42 [ 124.48, 997.82] ost 41 sz 687865856K rsz 1024K obj 164 thr 164 write 7513.17 [ 69.99, 236.95] rewrite 5611.77 [ 65.00, 198.96] read 12950.03 [ 80.99, 383.96] ost 41 sz 687865856K rsz 1024K obj 164 thr 328 write 13191.77 [ 216.99, 361.98] rewrite 10104.73 [ 129.99, 313.98] read 18380.92 [ 149.98, 529.97] ost 41 sz 687865856K rsz 1024K obj 164 thr 656 write 16442.83 [ 168.98, 494.91] rewrite 14155.27 [ 213.98, 452.97] read 19564.97 [ 238.85, 616.95] ost 41 sz 687865856K rsz 1024K obj 164 thr 1312 write 18070.58 [ 152.96, 612.91] rewrite 15744.41 [ 62.99, 540.96] read 18846.31 [ 160.99, 660.84] ost 41 sz 687865856K rsz 1024K obj 164 thr 2624 write 18664.83 [ 138.97, 767.93] rewrite 16648.63 [ 81.28, 603.93] read 19319.91 [ 79.97, 864.90] ost 41 sz 687865856K rsz 1024K obj 328 thr 328 write 9028.81 [ 66.00, 277.97] rewrite 6807.19 [ 42.99, 228.98] read 14799.75 [ 123.98, 491.92] ost 41 sz 687865856K rsz 1024K obj 328 thr 656 write 14471.67 [ 155.98, 427.97] rewrite 11632.72 [ 130.99, 375.98] read 19137.29 [ 127.79, 595.92] ost 41 sz 687865856K rsz 1024K obj 328 thr 1312 write 17084.20 [ 179.98, 533.95] rewrite 13810.96 [ 64.00, 449.96] read 18405.80 [ 182.98, 616.95] ost 41 sz 687865856K rsz 1024K obj 328 thr 2624 write 18583.14 [ 24.99, 684.92] rewrite 15588.87 [ 68.99, 579.93] read 18857.33 [ 160.98, 706.96] ost 41 sz 687865856K rsz 1024K obj 656 thr 656 write 9861.09 [ 121.98, 312.96] rewrite 7540.60 [ 70.00, 258.96] read 15160.96 [ 193.96, 483.94] ost 41 sz 687865856K rsz 1024K obj 656 thr 1312 write 15021.83 [ 175.97, 450.95] rewrite 11641.17 [ 97.99, 389.98] read 18470.04 [ 205.99, 597.91] ost 41 sz 687865856K rsz 1024K obj 656 thr 2624 write 17202.58 [ 84.98, 589.90] rewrite 14483.38 [ 143.98, 491.91] read 18475.50 [ 179.98, 631.94] We have not yet done any tests with clients (in fact the 10GigE network still needs to be configured) but I would like to ask if there is any reason why we should not achieve our goal with this storage hardware. I will also update the ticket once we've done some tests with clients. |
| Comments |
| Comment by Peter Jones [ 21/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh is helping with this initiative | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 22/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've now done initial tests with a dual 10GigE client and I get about 440MB/s write throughput using both ior and dd as tests. IOR test for a file with stripe count=2 below. I get the same result with any stripe count>1 that I've tried. [bnh65367@cs04r-sc-serv-66 ~]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -k -w -t1m -b 20g -i 3 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O
Began: Tue Jan 22 18:28:17 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-2/ior_dat -k -w -t1m -b 20g -i 3 -e
Machine: Linux cs04r-sc-serv-66.diamond.ac.uk
Test 0 started: Tue Jan 22 18:28:17 2013
Summary:
api = POSIX
test filename = /mnt/lustre-test/frederik1/stripe-2/ior_dat
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 3
xfersize = 1 MiB
blocksize = 20 GiB
aggregate filesize = 20 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 441.59 20971520 1024.00 0.002618 46.37 0.000736 46.38 0
write 475.01 20971520 1024.00 0.003628 43.11 0.000720 43.12 1
write 462.41 20971520 1024.00 0.003383 44.29 0.000516 44.29 2
Max Write: 475.01 MiB/sec (498.08 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 475.01 441.59 459.67 13.78 44.59420 0 1 1 3 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
Finished: Tue Jan 22 18:30:59 2013
[bnh65367@cs04r-sc-serv-66 ~]$
[bnh65367@cs04r-sc-serv-66 ~]$ lfs getstripe /mnt/lustre-test/frederik1/stripe-2/ior_dat /mnt/lustre-test/frederik1/stripe-2/ior_dat lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 5 obdidx objid objid group 5 176 0xb0 0 28 176 0xb0 0 [bnh65367@cs04r-sc-serv-66 ~]$ I have verified using netperf that I can send send at least 1100MB/s over the network to each of the OSSes and if I send to all OSSes at the same time I can send 590MB/s to each. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, Could you provide a little bit about the OSS HW config? # cores, memory, type of disk? Thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe $NSLOTS=1 above, could you try to run with 2, 4, 8? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Minh, thanks for looking into this. OSS HW: dual 6-core Intel Xeon E5-2630, 64GB RAM, OSTs are SRP LUNs 8+2 RAID6, DDN SFA12K. The client for this test is the same hardware. See below for a sample sgpdd-survey run, though I'm suspicious of the write performance for low crg counts, I suspect some write cache effects somewhere possibly storage, unfortunately I've not repeated this with larger sizes. [bnh65367@cs04r-sc-oss05-01 ~]$ sudo rawdevs=/dev/raw/raw1 sgpdd-survey Thu Jan 17 13:59:56 GMT 2013 sgpdd-survey on /dev/raw/raw1 from cs04r-sc-oss05-01.diamond.ac.uk total_size 8388608K rsz 1024 crg 1 thr 1 write 1432.90 MB/s 1 x 1435.99 = 1435.99 MB/s read 259.10 MB/s 1 x 259.21 = 259.21 MB/s total_size 8388608K rsz 1024 crg 1 thr 2 write 1376.59 MB/s 1 x 1378.40 = 1378.40 MB/s read 200.59 MB/s 1 x 200.63 = 200.63 MB/s total_size 8388608K rsz 1024 crg 1 thr 4 write 1381.94 MB/s 1 x 1383.87 = 1383.87 MB/s read 272.67 MB/s 1 x 272.77 = 272.77 MB/s total_size 8388608K rsz 1024 crg 1 thr 8 write 1361.22 MB/s 1 x 1363.05 = 1363.05 MB/s read 283.31 MB/s 1 x 283.38 = 283.38 MB/s total_size 8388608K rsz 1024 crg 1 thr 16 write 1384.87 MB/s 1 x 1386.87 = 1386.87 MB/s read 374.51 MB/s 1 x 374.72 = 374.72 MB/s total_size 8388608K rsz 1024 crg 2 thr 2 write 955.70 MB/s 2 x 478.29 = 956.57 MB/s read 168.72 MB/s 2 x 84.37 = 168.74 MB/s total_size 8388608K rsz 1024 crg 2 thr 4 write 1021.93 MB/s 2 x 511.69 = 1023.39 MB/s read 198.71 MB/s 2 x 99.37 = 198.75 MB/s total_size 8388608K rsz 1024 crg 2 thr 8 write 970.46 MB/s 2 x 485.71 = 971.41 MB/s read 201.85 MB/s 2 x 100.96 = 201.91 MB/s total_size 8388608K rsz 1024 crg 2 thr 16 write 1057.11 MB/s 2 x 529.13 = 1058.25 MB/s read 234.28 MB/s 2 x 117.17 = 234.34 MB/s total_size 8388608K rsz 1024 crg 2 thr 32 write 960.45 MB/s 2 x 480.69 = 961.38 MB/s read 211.48 MB/s 2 x 105.77 = 211.54 MB/s total_size 8388608K rsz 1024 crg 4 thr 4 write 709.30 MB/s 4 x 177.45 = 709.80 MB/s read 326.63 MB/s 4 x 81.68 = 326.73 MB/s total_size 8388608K rsz 1024 crg 4 thr 8 write 700.98 MB/s 4 x 175.37 = 701.48 MB/s read 282.53 MB/s 4 x 70.67 = 282.67 MB/s total_size 8388608K rsz 1024 crg 4 thr 16 write 752.53 MB/s 4 x 188.28 = 753.14 MB/s read 308.87 MB/s 4 x 77.24 = 308.95 MB/s total_size 8388608K rsz 1024 crg 4 thr 32 write 696.21 MB/s 4 x 174.18 = 696.72 MB/s read 280.55 MB/s 4 x 70.16 = 280.65 MB/s total_size 8388608K rsz 1024 crg 4 thr 64 write 690.79 MB/s 4 x 172.82 = 691.30 MB/s read 263.21 MB/s 4 x 65.82 = 263.29 MB/s total_size 8388608K rsz 1024 crg 8 thr 8 write 501.77 MB/s 8 x 62.75 = 502.01 MB/s read 325.21 MB/s 8 x 40.67 = 325.39 MB/s total_size 8388608K rsz 1024 crg 8 thr 16 write 506.74 MB/s 8 x 63.37 = 506.97 MB/s read 320.07 MB/s 8 x 40.03 = 320.21 MB/s total_size 8388608K rsz 1024 crg 8 thr 32 write 485.71 MB/s 8 x 60.75 = 485.99 MB/s read 353.40 MB/s 8 x 44.20 = 353.62 MB/s total_size 8388608K rsz 1024 crg 8 thr 64 write 501.04 MB/s 8 x 62.67 = 501.33 MB/s read 255.57 MB/s 8 x 31.96 = 255.66 MB/s total_size 8388608K rsz 1024 crg 8 thr 128 write 525.23 MB/s 8 x 65.71 = 525.67 MB/s read 378.57 MB/s 8 x 47.34 = 378.72 MB/s total_size 8388608K rsz 1024 crg 16 thr 16 write 383.54 MB/s 16 x 23.98 = 383.76 MB/s read 381.47 MB/s 16 x 23.85 = 381.62 MB/s total_size 8388608K rsz 1024 crg 16 thr 32 write 401.78 MB/s 16 x 25.12 = 401.92 MB/s read 392.46 MB/s 16 x 24.54 = 392.61 MB/s total_size 8388608K rsz 1024 crg 16 thr 64 write 418.10 MB/s 16 x 26.15 = 418.40 MB/s read 304.52 MB/s 16 x 19.04 = 304.57 MB/s total_size 8388608K rsz 1024 crg 16 thr 128 write 405.86 MB/s 16 x 25.38 = 406.04 MB/s read 325.64 MB/s 16 x 20.37 = 325.93 MB/s total_size 8388608K rsz 1024 crg 16 thr 256 write 389.65 MB/s 16 x 24.37 = 389.86 MB/s read 318.94 MB/s 16 x 19.94 = 319.06 MB/s total_size 8388608K rsz 1024 crg 32 thr 32 write 365.67 MB/s 32 x 11.43 = 365.91 MB/s read 184.33 MB/s 32 x 5.76 = 184.33 MB/s total_size 8388608K rsz 1024 crg 32 thr 64 write 352.64 MB/s 32 x 11.02 = 352.78 MB/s read 192.22 MB/s 32 x 6.01 = 192.26 MB/s total_size 8388608K rsz 1024 crg 32 thr 128 write 348.70 MB/s 32 x 10.90 = 348.82 MB/s read 239.66 MB/s 32 x 7.50 = 239.87 MB/s total_size 8388608K rsz 1024 crg 32 thr 256 write 299.37 MB/s 32 x 9.36 = 299.38 MB/s read 248.02 MB/s 32 x 7.75 = 248.11 MB/s total_size 8388608K rsz 1024 crg 32 thr 512 write 299.98 MB/s 32 x 9.37 = 299.99 MB/s read 229.41 MB/s 32 x 7.17 = 229.49 MB/s total_size 8388608K rsz 1024 crg 64 thr 64 write 273.48 MB/s 64 x 4.27 = 273.44 MB/s read 157.11 MB/s 64 x 2.45 = 156.86 MB/s total_size 8388608K rsz 1024 crg 64 thr 128 write 334.12 MB/s 64 x 5.23 = 334.47 MB/s read 184.61 MB/s 64 x 2.89 = 184.94 MB/s total_size 8388608K rsz 1024 crg 64 thr 256 write 298.72 MB/s 64 x 4.67 = 299.07 MB/s read 192.36 MB/s 64 x 3.00 = 192.26 MB/s total_size 8388608K rsz 1024 crg 64 thr 512 write 313.37 MB/s 64 x 4.90 = 313.72 MB/s read 193.55 MB/s 64 x 3.02 = 193.48 MB/s total_size 8388608K rsz 1024 crg 64 thr 1024 write 317.25 MB/s 64 x 4.96 = 317.38 MB/s read 191.37 MB/s 64 x 2.99 = 191.65 MB/s total_size 8388608K rsz 1024 crg 128 thr 128 write 297.69 MB/s 128 x 2.33 = 297.85 MB/s read 219.01 MB/s 128 x 1.71 = 218.51 MB/s total_size 8388608K rsz 1024 crg 128 thr 256 write 305.70 MB/s 128 x 2.39 = 306.40 MB/s read 209.97 MB/s 128 x 1.64 = 209.96 MB/s total_size 8388608K rsz 1024 crg 128 thr 512 write 276.41 MB/s 128 x 2.16 = 277.10 MB/s read 162.79 MB/s 128 x 1.27 = 162.35 MB/s total_size 8388608K rsz 1024 crg 128 thr 1024 write 301.00 MB/s 128 x 2.36 = 301.51 MB/s read 216.29 MB/s 128 x 1.69 = 216.06 MB/s total_size 8388608K rsz 1024 crg 128 thr 2048 write 258.84 MB/s 128 x 2.02 = 258.79 MB/s read 208.24 MB/s 128 x 1.63 = 208.74 MB/s total_size 8388608K rsz 1024 crg 256 thr 256 write 257.61 MB/s 256 x 1.01 = 258.79 MB/s read 222.66 MB/s 256 x 0.87 = 222.17 MB/s total_size 8388608K rsz 1024 crg 256 thr 512 write 254.39 MB/s 256 x 0.99 = 253.91 MB/s read 213.17 MB/s 256 x 0.83 = 212.40 MB/s total_size 8388608K rsz 1024 crg 256 thr 1024 write 247.27 MB/s 256 x 0.97 = 249.02 MB/s read 217.71 MB/s 256 x 0.85 = 217.29 MB/s total_size 8388608K rsz 1024 crg 256 thr 2048 write 257.63 MB/s 256 x 1.02 = 261.23 MB/s read 216.82 MB/s 256 x 0.85 = 217.29 MB/s total_size 8388608K rsz 1024 crg 256 thr 4096 write 278.90 MB/s 256 x 1.09 = 278.32 MB/s read 217.55 MB/s 256 x 0.85 = 217.29 MB/s You are correct, the ior test was with NSLOTS=1, I've also done tests with higher numbers without seeing any improvement. Here is a sample for NSLOTS=2 (which is also striped over 20 OSTs), I'll also run with 4 and 8 and update the call when I have the output. (I do get higher throughput if I select file-per-process but unfortunately that won't help us with the particular problem.) [bnh65367@cs04r-sc-serv-66 ~]$ $MPIRUN ${MPIRUN_OPTS} -np 2 -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -k -t1m -b 20g -i 2 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O
Began: Tue Jan 22 17:44:20 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20/ior_dat -k -w -t1m -b 20g -i 2 -e
Machine: Linux cs04r-sc-serv-66.diamond.ac.uk
Test 0 started: Tue Jan 22 17:44:20 2013
Summary:
api = POSIX
test filename = /mnt/lustre-test/frederik1/stripe-20/ior_dat
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 2 (2 per node)
repetitions = 2
xfersize = 1 MiB
blocksize = 20 GiB
aggregate filesize = 40 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 407.56 20971520 1024.00 0.005184 100.50 0.001251 100.50 0
write 404.99 20971520 1024.00 0.004892 101.13 0.001267 101.14 1
Max Write: 407.56 MiB/sec (427.35 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 407.56 404.99 406.27 1.28 100.81957 0 2 2 2 0 0 1 0 0 1 21474836480 1048576 42949672960 POSIX 0
Finished: Tue Jan 22 17:48:36 2013
[bnh65367@cs04r-sc-serv-66 ~]$
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
according to the obdfilter-survey above, you have 41 osts in that oss. is that true? could you provide more info? how many ost in a oss, and how many oss total? are they sharing the same storage ... - thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have 42 (8+2) LUNs in the SFA12K, for initial tests and as we don't expect to test metadata performance, we've got one of these LUNs used as MDT, this will be different storage in any final file systems. We've currently got 41 OSTs. 4 OSSes are connected to the storage, each using direct connected dual FDR 56Gbit/s IB connections (using dual-port cards, so these two connections would have a total bandwidth not much higher than 56Gbit/s). Each OSS has access to 21 LUNs. Without fail-over each OSS serves 10 or 11 OSTs. Everything is sharing the same SFA12K. The obdfilter-survey test was using all OSTs spread out over all OSTs (so 10-11 OSTs per OSS). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 23/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
could you set stripe=-1 (all), xfersize=4M and try ior again? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 24/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, I've now repeated the tests with xfersize=1M for NSLOTS=1,2,4,8 gathering The full output is attached along with a tar file of all brw_stats files. In the output youll see lines like this: spfs1_brw_stats_20130124133533 These are the directory names containing all brw_stats files by server and OST Here's a summary of the results (each tests with 2 iterations):
Cheers | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the results. I think you are aware of | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
could you also install master version (tag 2.3.59) on the client and try again? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I'm aware of The bonding on the client is LACP with these parameters: "mode=802.3ad xmit_hash_policy=2 miimon=100". I have verified using netperf that I can send about 1GB/s to one OSS using one stream and using two streams I can send ~2GB/s if I pick a suitable combination of OSSes. FWIW, I have also tested the file-per-process option in ior with multiple processes and I've seen 1.6GB/s write throughput for IIRC 10 or 12 processes. Today I've also tried the latest master from git (commit 57373a2, client that I compiled myself for the kernel I'm using for these tests, should I try tag 2.3.59 specifically?) and the version of Lustre 1.8.8(.60) that we use on all our other clients. Summarised results are below. With the 1.8 client and checksums on I got about 520MiB/s and with checksums off 620MiB/s (haven't recorded the ior output though). With master on the client I got this:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Peter Jones [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Frederik A number of fixes from Peter | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Peter, thanks for letting me know. Should I upload the full ior output and/or brw_stats for my tests with master or is the summary enough? Cheers, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
just the ior is good. Please try both shared file and file per process. thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, I've attached the full output for a new ior run with master on the client, each test was done with 3 iterations. The order of tests in this output was
As a summary these are the results: Single shared file:
File per process:
I've noticed that at least in the first test for a single file, the difference in the individual test results is relatively big (308MiB/s to 510MiB/s), I've not investigated this any further yet. Frederik | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 28/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think it's worth a try to increase the xfersize to 8, 16, 32, 64M. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 28/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm currently repeating the tests with larger xfersize as suggested. It might be possible to add more OSS nodes but I'm not too confident that this will help us much, especially as in my tests so far the number of OSTs didn't seem to make much difference even when going down to 2 OSTs (so only using 2 OSSes max). AFAICT the obdfilter-survey shows that the OSSes can push the data do the storage fast enough. We should have sufficient network bandwidth available and according to our monitoring the OSS nodes are not busy. On the subject of network bandwidth, do you have a good lnet_selftest script to verify what the network performance for lnet is for this system? Using netperf I was able to confirm that the basic TCP network performance is good. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 28/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, I've repeated the single-shared-file tests for larger xfersize as suggested,
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 04/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is a sample script to run brw_test that I used before. You can edit to fit your env #!/bin/sh PATH=$PATH:/usr/sbin C=xxx@o2ib1 C_COUNT=`echo $C | wc -w` '` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 06/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the lnet_selftest script, though it was a bit hard to read with the formating etc... I've quickly run that now on one client and two servers, output below: [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
cs04r-sc-serv-68-10g are added to session
cs04r-sc-oss05-03-10g are added to session
cs04r-sc-oss05-04-10g are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 2 0 0 0 2
Test 1(brw) (loop: 1800000, concurrency: 32)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 2 0 0 0 2
Test 2(brw) (loop: 1800000, concurrency: 32)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 2 0 0 0 2
[LNet Rates of c]
[R] Avg: 7704 RPC/s Min: 7704 RPC/s Max: 7704 RPC/s
[W] Avg: 5695 RPC/s Min: 5695 RPC/s Max: 5695 RPC/s
[LNet Bandwidth of c]
[R] Avg: 2011.80 MB/s Min: 2011.80 MB/s Max: 2011.80 MB/s
[W] Avg: 1841.79 MB/s Min: 1841.79 MB/s Max: 1841.79 MB/s
[LNet Rates of s]
[R] Avg: 2849 RPC/s Min: 2208 RPC/s Max: 3490 RPC/s
[W] Avg: 3853 RPC/s Min: 3045 RPC/s Max: 4661 RPC/s
[LNet Bandwidth of s]
[R] Avg: 921.26 MB/s Min: 683.45 MB/s Max: 1159.07 MB/s
[W] Avg: 1005.93 MB/s Min: 840.24 MB/s Max: 1171.62 MB/s
[LNet Rates of c]
[R] Avg: 7634 RPC/s Min: 7634 RPC/s Max: 7634 RPC/s
[W] Avg: 5634 RPC/s Min: 5634 RPC/s Max: 5634 RPC/s
[LNet Bandwidth of c]
[R] Avg: 1998.43 MB/s Min: 1998.43 MB/s Max: 1998.43 MB/s
[W] Avg: 1819.24 MB/s Min: 1819.24 MB/s Max: 1819.24 MB/s
[LNet Rates of s]
[R] Avg: 2818 RPC/s Min: 2137 RPC/s Max: 3499 RPC/s
[W] Avg: 3816 RPC/s Min: 2961 RPC/s Max: 4672 RPC/s
[LNet Bandwidth of s]
[R] Avg: 909.21 MB/s Min: 656.14 MB/s Max: 1162.28 MB/s
[W] Avg: 998.65 MB/s Min: 823.79 MB/s Max: 1173.52 MB/s
[LNet Rates of c]
[R] Avg: 7322 RPC/s Min: 7322 RPC/s Max: 7322 RPC/s
[W] Avg: 5409 RPC/s Min: 5409 RPC/s Max: 5409 RPC/s
[LNet Bandwidth of c]
[R] Avg: 1914.47 MB/s Min: 1914.47 MB/s Max: 1914.47 MB/s
[W] Avg: 1747.85 MB/s Min: 1747.85 MB/s Max: 1747.85 MB/s
[LNet Rates of s]
[R] Avg: 2704 RPC/s Min: 1897 RPC/s Max: 3510 RPC/s
[W] Avg: 3660 RPC/s Min: 2636 RPC/s Max: 4685 RPC/s
[LNet Bandwidth of s]
[R] Avg: 873.47 MB/s Min: 579.29 MB/s Max: 1167.64 MB/s
[W] Avg: 956.83 MB/s Min: 738.93 MB/s Max: 1174.73 MB/s
[LNet Rates of c]
[R] Avg: 7580 RPC/s Min: 7580 RPC/s Max: 7580 RPC/s
[W] Avg: 5594 RPC/s Min: 5594 RPC/s Max: 5594 RPC/s
[LNet Bandwidth of c]
[R] Avg: 1988.69 MB/s Min: 1988.69 MB/s Max: 1988.69 MB/s
[W] Avg: 1803.03 MB/s Min: 1803.03 MB/s Max: 1803.03 MB/s
[LNet Rates of s]
[R] Avg: 2796 RPC/s Min: 2112 RPC/s Max: 3480 RPC/s
[W] Avg: 3789 RPC/s Min: 2927 RPC/s Max: 4650 RPC/s
[LNet Bandwidth of s]
[R] Avg: 901.00 MB/s Min: 647.69 MB/s Max: 1154.30 MB/s
[W] Avg: 993.80 MB/s Min: 817.02 MB/s Max: 1170.58 MB/s
[LNet Rates of c]
[R] Avg: 8064 RPC/s Min: 8064 RPC/s Max: 8064 RPC/s
[W] Avg: 5957 RPC/s Min: 5957 RPC/s Max: 5957 RPC/s
[LNet Bandwidth of c]
[R] Avg: 2105.40 MB/s Min: 2105.40 MB/s Max: 2105.40 MB/s
[W] Avg: 1926.91 MB/s Min: 1926.91 MB/s Max: 1926.91 MB/s
[LNet Rates of s]
[R] Avg: 2973 RPC/s Min: 2468 RPC/s Max: 3479 RPC/s
[W] Avg: 4026 RPC/s Min: 3403 RPC/s Max: 4648 RPC/s
[LNet Bandwidth of s]
[R] Avg: 961.77 MB/s Min: 768.23 MB/s Max: 1155.32 MB/s
[W] Avg: 1050.98 MB/s Min: 932.80 MB/s Max: 1169.15 MB/s
[LNet Rates of c]
[R] Avg: 7601 RPC/s Min: 7601 RPC/s Max: 7601 RPC/s
[W] Avg: 5624 RPC/s Min: 5624 RPC/s Max: 5624 RPC/s
[LNet Bandwidth of c]
[R] Avg: 1977.67 MB/s Min: 1977.67 MB/s Max: 1977.67 MB/s
[W] Avg: 1824.20 MB/s Min: 1824.20 MB/s Max: 1824.20 MB/s
[LNet Rates of s]
[R] Avg: 2814 RPC/s Min: 2173 RPC/s Max: 3454 RPC/s
[W] Avg: 3802 RPC/s Min: 2993 RPC/s Max: 4610 RPC/s
[LNet Bandwidth of s]
[R] Avg: 912.18 MB/s Min: 676.17 MB/s Max: 1148.19 MB/s
[W] Avg: 988.94 MB/s Min: 820.77 MB/s Max: 1157.10 MB/s
No session exists
This was running until I terminated it in another window: [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r stop c: Total 0 error nodes in c s: Total 0 error nodes in s 1 batch in stopping Batch is stopped session is ended [bnh65367@cs04r-sc-serv-68 bin]$ This was done using 2.3.59 on the client and 2.3.0 on the servers. The client | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 06/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The throughput in the previous test was good, though I've noticed that the throughput seems to drop to about 550MB/s if I use one client, one server and reduce concurrency to 1, I wonder if that is related to the single stream performance that we experience? Is the client effectively only ever writing to one server at a time or something similar? [bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start -s cs04r-sc-oss05-03-10g -C 1
CONCURRENCY=1
session is ended
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
cs04r-sc-serv-68-10g are added to session
cs04r-sc-oss05-03-10g are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 1 0 0 0 1
Test 1(brw) (loop: 1800000, concurrency: 1)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 1 0 0 0 1
Test 2(brw) (loop: 1800000, concurrency: 1)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 1 0 0 0 1
server 1 0 0 0 1
[LNet Rates of c]
[R] Avg: 2245 RPC/s Min: 2245 RPC/s Max: 2245 RPC/s
[W] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s
[LNet Bandwidth of c]
[R] Avg: 557.79 MB/s Min: 557.79 MB/s Max: 557.79 MB/s
[W] Avg: 565.19 MB/s Min: 565.19 MB/s Max: 565.19 MB/s
[LNet Rates of s]
[R] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s
[W] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s
[LNet Bandwidth of s]
[R] Avg: 564.91 MB/s Min: 564.91 MB/s Max: 564.91 MB/s
[W] Avg: 557.47 MB/s Min: 557.47 MB/s Max: 557.47 MB/s
[LNet Rates of c]
[R] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s
[W] Avg: 1689 RPC/s Min: 1689 RPC/s Max: 1689 RPC/s
[LNet Bandwidth of c]
[R] Avg: 556.52 MB/s Min: 556.52 MB/s Max: 556.52 MB/s
[W] Avg: 566.62 MB/s Min: 566.62 MB/s Max: 566.62 MB/s
[LNet Rates of s]
[R] Avg: 1690 RPC/s Min: 1690 RPC/s Max: 1690 RPC/s
[W] Avg: 2246 RPC/s Min: 2246 RPC/s Max: 2246 RPC/s
[LNet Bandwidth of s]
[R] Avg: 566.36 MB/s Min: 566.36 MB/s Max: 566.36 MB/s
[W] Avg: 556.22 MB/s Min: 556.22 MB/s Max: 556.22 MB/s
[LNet Rates of c]
[R] Avg: 2250 RPC/s Min: 2250 RPC/s Max: 2250 RPC/s
[W] Avg: 1690 RPC/s Min: 1690 RPC/s Max: 1690 RPC/s
[LNet Bandwidth of c]
[R] Avg: 559.59 MB/s Min: 559.59 MB/s Max: 559.59 MB/s
[W] Avg: 565.44 MB/s Min: 565.44 MB/s Max: 565.44 MB/s
[LNet Rates of s]
[R] Avg: 1691 RPC/s Min: 1691 RPC/s Max: 1691 RPC/s
[W] Avg: 2250 RPC/s Min: 2250 RPC/s Max: 2250 RPC/s
[LNet Bandwidth of s]
[R] Avg: 565.11 MB/s Min: 565.11 MB/s Max: 565.11 MB/s
[W] Avg: 559.31 MB/s Min: 559.31 MB/s Max: 559.31 MB/s
[LNet Rates of c]
[R] Avg: 2248 RPC/s Min: 2248 RPC/s Max: 2248 RPC/s
[W] Avg: 1688 RPC/s Min: 1688 RPC/s Max: 1688 RPC/s
[LNet Bandwidth of c]
[R] Avg: 560.44 MB/s Min: 560.44 MB/s Max: 560.44 MB/s
[W] Avg: 563.74 MB/s Min: 563.74 MB/s Max: 563.74 MB/s
[LNet Rates of s]
[R] Avg: 1687 RPC/s Min: 1687 RPC/s Max: 1687 RPC/s
[W] Avg: 2247 RPC/s Min: 2247 RPC/s Max: 2247 RPC/s
[LNet Bandwidth of s]
[R] Avg: 563.46 MB/s Min: 563.46 MB/s Max: 563.46 MB/s
[W] Avg: 560.16 MB/s Min: 560.16 MB/s Max: 560.16 MB/s
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 14/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
yes, it's because you set concurrency=1. it's like running a single thread. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 14/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
hi, could you print out the lctl dl -t from your client? -thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, not sure if I mentioned this, but I had to reduce my file system to only 20 OSTs on 2 OSSes as I had to start investigating alternatives on the rest of the hardware. Here is the requested lctl dl -t output. [bnh65367@cs04r-sc-serv-68 frederik1]$ lctl dl -t | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Frederik, I suggest we measure how much an OSS can deliver to one client. You can achieve this by lfs setstripe -c 10 -o 1 <ior dir>. Please try ior with -np 1 and let me know. Thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, 570MiB/s see below (for just one iteration) [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir single-oss
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 1 single-oss/
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss
[bnh65367@cs04r-sc-serv-68 frederik1]$ export NSLOTS=1
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O
Began: Fri Feb 15 16:37:50 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss/ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk
Test 0 started: Fri Feb 15 16:37:50 2013
Summary:
api = POSIX
test filename = /mnt/lustre-test/frederik1/single-oss/ior_dat
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 1 MiB
blocksize = 20 GiB
aggregate filesize = 20 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 570.96 20971520 1024.00 0.000590 35.87 0.000212 35.87 0
Max Write: 570.96 MiB/sec (598.69 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 570.96 570.96 570.96 0.00 35.86953 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
Finished: Fri Feb 15 16:38:26 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Also, the reason I tried the lnet selftest with concurrency one is because my feeling is that this might be close to what happens for single process writes. Looking at the two numbers in throughput (concurrency=1 lnet selftest and single process ior), these see very close to each other. Also the other day I did a test while watching /proc/sys/lnet/peers every 1/10th second and there was only ever one of the two nids with anything reported as queued. Not sure if this is relevant or not... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Frederick, Thanks for the quick response. Could you try again with lfs setstripe -c 20? thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
here you go (this seems slower?): [bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 20 stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O
Began: Fri Feb 15 17:01:36 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20-1/ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk
Test 0 started: Fri Feb 15 17:01:36 2013
Summary:
api = POSIX
test filename = /mnt/lustre-test/frederik1/stripe-20-1/ior_dat
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 1 MiB
blocksize = 20 GiB
aggregate filesize = 20 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 412.76 20971520 1024.00 0.000905 49.62 0.000229 49.62 0
Max Write: 412.76 MiB/sec (432.81 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 412.76 412.76 412.76 0.00 49.61738 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
Finished: Fri Feb 15 17:02:25 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
this is very strange to me that the write is worse than single oss. I have tested in my lab and the write is about double with two oss. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We could do a few things to see: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
just to confirm for 2) I assume it is -o 1 (and not a)? (And yes it looks like we are using all 20 OSTs: [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe /mnt/lustre-test/frederik1/stripe-20-1/ior_dat
/mnt/lustre-test/frederik1/stripe-20-1/ior_dat
lmm_stripe_count: 20
lmm_stripe_size: 1048576
lmm_layout_gen: 0
lmm_stripe_offset: 14
obdidx objid objid group
14 1380 0x564 0
6 1320 0x528 0
15 1380 0x564 0
7 1319 0x527 0
16 1380 0x564 0
8 1319 0x527 0
17 1380 0x564 0
9 1320 0x528 0
18 1380 0x564 0
0 1326 0x52e 0
19 1380 0x564 0
1 1323 0x52b 0
10 1380 0x564 0
2 1320 0x528 0
11 1380 0x564 0
3 1320 0x528 0
12 1476 0x5c4 0
4 1321 0x529 0
13 1380 0x564 0
5 1320 0x528 0
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ok, going through the suggestions one step at a time (after playing with lfs setstripe -o a bit, I think I've got it.) ior on just the second OSS: [bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 0xa single-oss-2/
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe single-oss-2/
single-oss-2/
stripe_count: 10 stripe_size: 1048576 stripe_offset: 10
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss-2/
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O
Began: Fri Feb 15 17:33:57 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss-2//ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk
Test 0 started: Fri Feb 15 17:33:57 2013
Summary:
api = POSIX
test filename = /mnt/lustre-test/frederik1/single-oss-2//ior_dat
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 1 MiB
blocksize = 20 GiB
aggregate filesize = 20 GiB
access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---------- --------- -------- -------- -------- -------- ----
write 468.96 20971520 1024.00 0.000589 43.67 0.000200 43.67 0
Max Write: 468.96 MiB/sec (491.74 MB/sec)
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write 468.96 468.96 468.96 0.00 43.67131 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0
Finished: Fri Feb 15 17:34:41 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe single-oss-2/ior_dat
single-oss-2/ior_dat
lmm_stripe_count: 10
lmm_stripe_size: 1048576
lmm_layout_gen: 0
lmm_stripe_offset: 10
obdidx objid objid group
10 1381 0x565 0
11 1381 0x565 0
12 1477 0x5c5 0
13 1381 0x565 0
14 1381 0x565 0
15 1381 0x565 0
16 1381 0x565 0
17 1381 0x565 0
18 1381 0x565 0
19 1381 0x565 0
rpc_stats for one OST: [bnh65367@cs04r-sc-serv-68 frederik1]$ cat /proc/fs/lustre/osc/spfs1-OST0007-osc-ffff880829227800/rpc_stats
snapshot_time: 1360950242.989930 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
pending write pages: 0
pending read pages: 0
read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 0 0 0 | 25589 100 100
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
1: 0 0 0 | 6979 27 27
2: 0 0 0 | 3272 12 40
3: 0 0 0 | 4127 16 56
4: 0 0 0 | 644 2 58
5: 0 0 0 | 515 2 60
6: 0 0 0 | 561 2 62
7: 0 0 0 | 1378 5 68
8: 0 0 0 | 2442 9 77
9: 0 0 0 | 3931 15 93
10: 0 0 0 | 1725 6 99
11: 0 0 0 | 15 0 100
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 17 0 0
1: 0 0 0 | 0 0 0
2: 0 0 0 | 0 0 0
4: 0 0 0 | 0 0 0
8: 0 0 0 | 0 0 0
16: 0 0 0 | 0 0 0
32: 0 0 0 | 0 0 0
64: 0 0 0 | 0 0 0
128: 0 0 0 | 0 0 0
256: 0 0 0 | 17 0 0
512: 0 0 0 | 34 0 0
1024: 0 0 0 | 68 0 0
2048: 0 0 0 | 136 0 1
4096: 0 0 0 | 272 1 2
8192: 0 0 0 | 544 2 4
16384: 0 0 0 | 1088 4 8
32768: 0 0 0 | 2125 8 16
65536: 0 0 0 | 4096 16 32
131072: 0 0 0 | 7976 31 63
262144: 0 0 0 | 3072 12 75
524288: 0 0 0 | 4096 16 91
1048576: 0 0 0 | 2048 8 100
[bnh65367@cs04r-sc-serv-68 frederik1]$
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
And another test with all OSTs: Cut down iostat over 10 second interval in the middle of the test on first OSS, with ost details added. Not that this was fairly constant over the whole test. Note also that I was running iostat 10, so the result is in blocks. Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dm-3 0.00 0.00 0.00 0 0 dm-4 0.00 0.00 0.00 0 0 dm-5 0.00 0.00 0.00 0 0 dm-6 26.90 1.60 53467.20 16 534672 (ost13) dm-7 0.00 0.00 0.00 0 0 dm-8 26.90 1.60 53468.00 16 534680 (ost18) dm-9 26.80 0.80 53467.20 8 534672 (ost11) dm-10 0.00 0.00 0.00 0 0 dm-11 26.80 1.60 53262.40 16 532624 (ost16) dm-12 26.90 1.60 53467.20 16 534672 (ost19) dm-13 26.90 1.60 53468.00 16 534680 (ost17) dm-14 26.60 0.00 53262.40 0 532624 (ost14) dm-15 26.90 1.60 53467.20 16 534672 (ost10) dm-16 26.90 1.60 53467.20 16 534672 (ost15) dm-17 26.80 1.60 53262.40 16 532624 (ost14) dm-18 0.00 0.00 0.00 0 0 dm-19 0.00 0.00 0.00 0 0 dm-20 0.00 0.00 0.00 0 0 dm-21 0.00 0.00 0.00 0 0 dm-22 0.00 0.00 0.00 0 0 dm-23 0.00 0.00 0.00 0 0 Cut down iostat over 10 second interval in the middle of the test on second OSS: Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn dm-3 26.60 0.00 53665.60 0 536656 (ost0) dm-4 26.60 0.00 53467.20 0 534672 (ost5) dm-5 26.70 0.00 53467.20 0 534672 (ost6) dm-6 26.80 0.00 53672.00 0 536720 (ost1) dm-7 26.60 0.00 53466.40 0 534664 (ost7) dm-8 26.70 0.00 53467.20 0 534672 (ost8) dm-9 1.10 0.00 9.60 0 96 (mdt) dm-10 26.70 0.00 53467.20 0 534672 (ost9) dm-11 0.00 0.00 0.00 0 0 dm-12 0.00 0.00 0.00 0 0 dm-13 0.00 0.00 0.00 0 0 dm-14 26.80 0.00 53672.00 0 536720 (ost3) dm-15 26.80 0.00 53672.00 0 536720 (ost4) dm-16 0.00 0.00 0.00 0 0 dm-17 0.00 0.00 0.00 0 0 dm-18 26.80 0.00 53672.00 0 536720 (ost2) dm-19 0.00 0.00 0.00 0 0 dm-20 0.00 0.00 0.00 0 0 dm-21 0.00 0.00 0.00 0 0 dm-22 0.00 0.00 0.00 0 0 dm-23 0.00 0.00 0.00 0 0 IOR reported a throughput of about 520MiB/s. Traffic seems fairly balanced to me. Can I just compare exact Lustre versions? I'm still using lustre 2.3.0 on the servers and 2.5.59 on the clients. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 15/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am using 2.3.0 on the server and latest lustre-master from yesterday. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 18/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Minh, just to confirm, on your test system you are running over 10GigE? (bonded links?) And do you get close to what I get on a single OSS or much less? What is the approximate throughput you get with two OSSes over 10GigE? Frederik | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 18/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No, I am running over IB. My small setup is just 2 SATA drive on each OSS, but I was able to scale linearly up to 4 OSS which reached about 740MB/s. I am trying to reconfigure to try to achieve 1GB/s, either add more oss or more disk per oss. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 18/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
it puzzle me that your two oss combine performs slower than a single. Something is not configure correct here. If you still have the system, we should experiment a run on each individual OST. 1. create a ost_pool on first OST on one OSS and setstripe on that pool, run ior Thanks | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 18/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is it possible to have remote access to your cluster? Please let me know | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Frederik Ferner (Inactive) [ 18/Feb/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
From what I've gathered from other sites, performance over IB seems different from 10GigE. Unfortunately we won't be able to change our infrastructure to IB as part of this project. Remote access to the test system should be possible, let's discuss details on that over private email, if you don't mind. Frederik | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Minh Diep [ 28/Jun/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Frederick, Is there anything else that needs to be done on this ticket? |