[LU-2659] single client throughput for 10GigE Created: 21/Jan/13  Updated: 27/Aug/13  Resolved: 27/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Improvement Priority: Minor
Reporter: Frederik Ferner (Inactive) Assignee: Minh Diep
Resolution: Not a Bug Votes: 0
Labels: ptr

Attachments: File LU-2659-brw_stats-20130124.tar     Text File LU-2659-ior-results-20130124.txt     Text File ior-2.3.59-output.txt     Text File ior-large-xfersize.txt    
Rank (Obsolete): 6208

 Description   

(I'm not sure about the issue type for this ticket, please adjust as appropriate.)

As discussed with Peter Jones, we are trying to implement a file system where single clients can achieve >900MB/s write throughput over 10GigE connections. Ideally single 10GigE for the clients but 2x10GigE LACP bonding might be an option. The OSSes will initially have 4x 10GigE LACP bonded links, though for some initial testing we might start with fewer links.

The disk backend has now arrived and this is a sample obdfilter-survey result using all one OST and 4 OSSes, without much tuning on the OSS nodes yet. The OSSes are all running Lustre 2.3.0 on RHEL6.

Sat Jan 19 15:49:23 GMT 2013 Obdfilter-survey for case=disk from cs04r-sc-oss05-03.diamond.ac.uk
ost 41 sz 687865856K rsz 1024K obj   41 thr   41 write 2975.14 [  40.00, 105.99] rewrite 2944.84 [  22.00, 118.99] read 8104.33 [  40.99, 231.98]
ost 41 sz 687865856K rsz 1024K obj   41 thr   82 write 5231.39 [  49.99, 167.98] rewrite 4984.58 [  29.98, 171.89] read 13807.08 [ 161.99, 514.92]
ost 41 sz 687865856K rsz 1024K obj   41 thr  164 write 9445.93 [  82.99, 293.98] rewrite 9722.32 [ 149.98, 324.96] read 17851.10 [ 191.97, 869.92]
ost 41 sz 687865856K rsz 1024K obj   41 thr  328 write 15872.41 [ 265.96, 533.94] rewrite 16682.58 [ 245.97, 526.97] read 19312.61 [ 184.98, 794.93]
ost 41 sz 687865856K rsz 1024K obj   41 thr  656 write 18704.47 [ 222.98, 651.94] rewrite 18733.29 [ 252.90, 634.83] read 21040.28 [ 260.98, 808.92]
ost 41 sz 687865856K rsz 1024K obj   41 thr 1312 write 18291.71 [ 161.99, 740.93] rewrite 18443.63 [  47.00, 704.91] read 20683.56 [ 178.99, 908.91]
ost 41 sz 687865856K rsz 1024K obj   41 thr 2624 write 18704.50 [  19.00, 684.92] rewrite 18583.81 [  25.00, 729.92] read 20400.08 [ 110.99, 982.88]
ost 41 sz 687865856K rsz 1024K obj   82 thr   82 write 5634.08 [  62.99, 176.98] rewrite 4640.45 [  55.00, 162.98] read 9459.26 [ 114.98, 320.99]
ost 41 sz 687865856K rsz 1024K obj   82 thr  164 write 9615.85 [  95.99, 308.98] rewrite 8329.19 [ 122.99, 275.99] read 13967.03 [ 150.99, 430.97]
ost 41 sz 687865856K rsz 1024K obj   82 thr  328 write 13846.63 [ 229.99, 461.97] rewrite 12576.55 [ 186.98, 390.97] read 18166.27 [ 130.99, 557.94]
ost 41 sz 687865856K rsz 1024K obj   82 thr  656 write 18558.35 [ 268.98, 624.93] rewrite 16821.93 [ 246.85, 542.95] read 19645.73 [ 235.85, 676.92]
ost 41 sz 687865856K rsz 1024K obj   82 thr 1312 write 18885.19 [ 117.99, 690.92] rewrite 16501.04 [ 115.99, 617.95] read 19255.26 [ 180.97, 832.89]
ost 41 sz 687865856K rsz 1024K obj   82 thr 2624 write 18991.31 [ 127.51, 784.92] rewrite 18111.05 [  31.00, 763.88] read 20333.42 [ 124.48, 997.82]
ost 41 sz 687865856K rsz 1024K obj  164 thr  164 write 7513.17 [  69.99, 236.95] rewrite 5611.77 [  65.00, 198.96] read 12950.03 [  80.99, 383.96]
ost 41 sz 687865856K rsz 1024K obj  164 thr  328 write 13191.77 [ 216.99, 361.98] rewrite 10104.73 [ 129.99, 313.98] read 18380.92 [ 149.98, 529.97]
ost 41 sz 687865856K rsz 1024K obj  164 thr  656 write 16442.83 [ 168.98, 494.91] rewrite 14155.27 [ 213.98, 452.97] read 19564.97 [ 238.85, 616.95]
ost 41 sz 687865856K rsz 1024K obj  164 thr 1312 write 18070.58 [ 152.96, 612.91] rewrite 15744.41 [  62.99, 540.96] read 18846.31 [ 160.99, 660.84]
ost 41 sz 687865856K rsz 1024K obj  164 thr 2624 write 18664.83 [ 138.97, 767.93] rewrite 16648.63 [  81.28, 603.93] read 19319.91 [  79.97, 864.90]
ost 41 sz 687865856K rsz 1024K obj  328 thr  328 write 9028.81 [  66.00, 277.97] rewrite 6807.19 [  42.99, 228.98] read 14799.75 [ 123.98, 491.92]
ost 41 sz 687865856K rsz 1024K obj  328 thr  656 write 14471.67 [ 155.98, 427.97] rewrite 11632.72 [ 130.99, 375.98] read 19137.29 [ 127.79, 595.92]
ost 41 sz 687865856K rsz 1024K obj  328 thr 1312 write 17084.20 [ 179.98, 533.95] rewrite 13810.96 [  64.00, 449.96] read 18405.80 [ 182.98, 616.95]
ost 41 sz 687865856K rsz 1024K obj  328 thr 2624 write 18583.14 [  24.99, 684.92] rewrite 15588.87 [  68.99, 579.93] read 18857.33 [ 160.98, 706.96]
ost 41 sz 687865856K rsz 1024K obj  656 thr  656 write 9861.09 [ 121.98, 312.96] rewrite 7540.60 [  70.00, 258.96] read 15160.96 [ 193.96, 483.94]
ost 41 sz 687865856K rsz 1024K obj  656 thr 1312 write 15021.83 [ 175.97, 450.95] rewrite 11641.17 [  97.99, 389.98] read 18470.04 [ 205.99, 597.91]
ost 41 sz 687865856K rsz 1024K obj  656 thr 2624 write 17202.58 [  84.98, 589.90] rewrite 14483.38 [ 143.98, 491.91] read 18475.50 [ 179.98, 631.94]

We have not yet done any tests with clients (in fact the 10GigE network still needs to be configured) but I would like to ask if there is any reason why we should not achieve our goal with this storage hardware.

I will also update the ticket once we've done some tests with clients.



 Comments   
Comment by Peter Jones [ 21/Jan/13 ]

Minh is helping with this initiative

Comment by Frederik Ferner (Inactive) [ 22/Jan/13 ]

I've now done initial tests with a dual 10GigE client and I get about 440MB/s write throughput using both ior and dd as tests.

IOR test for a file with stripe count=2 below. I get the same result with any stripe count>1 that I've tried.

[bnh65367@cs04r-sc-serv-66 ~]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat  -k -w -t1m -b 20g -i 3 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Tue Jan 22 18:28:17 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-2/ior_dat -k -w -t1m -b 20g -i 3 -e
Machine: Linux cs04r-sc-serv-66.diamond.ac.uk

Test 0 started: Tue Jan 22 18:28:17 2013
Summary:
	api                = POSIX
	test filename      = /mnt/lustre-test/frederik1/stripe-2/ior_dat
	access             = single-shared-file
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 1 (1 per node)
	repetitions        = 3
	xfersize           = 1 MiB
	blocksize          = 20 GiB
	aggregate filesize = 20 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     441.59     20971520   1024.00    0.002618   46.37      0.000736   46.38      0   
write     475.01     20971520   1024.00    0.003628   43.11      0.000720   43.12      1   
write     462.41     20971520   1024.00    0.003383   44.29      0.000516   44.29      2   

Max Write: 475.01 MiB/sec (498.08 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         475.01     441.59     459.67      13.78   44.59420 0 1 1 3 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0

Finished: Tue Jan 22 18:30:59 2013
[bnh65367@cs04r-sc-serv-66 ~]$ 
[bnh65367@cs04r-sc-serv-66 ~]$ lfs getstripe /mnt/lustre-test/frederik1/stripe-2/ior_dat
/mnt/lustre-test/frederik1/stripe-2/ior_dat
lmm_stripe_count:   2
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  5
	obdidx		 objid		objid		 group
	     5	           176	         0xb0	             0
	    28	           176	         0xb0	             0

[bnh65367@cs04r-sc-serv-66 ~]$ 

I have verified using netperf that I can send send at least 1100MB/s over the network to each of the OSSes and if I send to all OSSes at the same time I can send 590MB/s to each.

Comment by Minh Diep [ 23/Jan/13 ]

Hi,

Could you provide a little bit about the OSS HW config? # cores, memory, type of disk?
Have you tried sgpdd-survey? How much (MB/s) the individual drive deliver?

Thanks

Comment by Minh Diep [ 23/Jan/13 ]

I believe $NSLOTS=1 above, could you try to run with 2, 4, 8?

Comment by Frederik Ferner (Inactive) [ 23/Jan/13 ]

Hi Minh,

thanks for looking into this.

OSS HW: dual 6-core Intel Xeon E5-2630, 64GB RAM, OSTs are SRP LUNs 8+2 RAID6, DDN SFA12K.

The client for this test is the same hardware.

See below for a sample sgpdd-survey run, though I'm suspicious of the write performance for low crg counts, I suspect some write cache effects somewhere possibly storage, unfortunately I've not repeated this with larger sizes.

[bnh65367@cs04r-sc-oss05-01 ~]$ sudo rawdevs=/dev/raw/raw1 sgpdd-survey
Thu Jan 17 13:59:56 GMT 2013 sgpdd-survey on /dev/raw/raw1 from cs04r-sc-oss05-01.diamond.ac.uk
total_size  8388608K rsz 1024 crg     1 thr     1 write 1432.90 MB/s     1 x 1435.99 = 1435.99 MB/s read  259.10 MB/s     1 x 259.21 =  259.21 MB/s 
total_size  8388608K rsz 1024 crg     1 thr     2 write 1376.59 MB/s     1 x 1378.40 = 1378.40 MB/s read  200.59 MB/s     1 x 200.63 =  200.63 MB/s 
total_size  8388608K rsz 1024 crg     1 thr     4 write 1381.94 MB/s     1 x 1383.87 = 1383.87 MB/s read  272.67 MB/s     1 x 272.77 =  272.77 MB/s 
total_size  8388608K rsz 1024 crg     1 thr     8 write 1361.22 MB/s     1 x 1363.05 = 1363.05 MB/s read  283.31 MB/s     1 x 283.38 =  283.38 MB/s 
total_size  8388608K rsz 1024 crg     1 thr    16 write 1384.87 MB/s     1 x 1386.87 = 1386.87 MB/s read  374.51 MB/s     1 x 374.72 =  374.72 MB/s 
total_size  8388608K rsz 1024 crg     2 thr     2 write  955.70 MB/s     2 x 478.29 =  956.57 MB/s read  168.72 MB/s     2 x  84.37 =  168.74 MB/s 
total_size  8388608K rsz 1024 crg     2 thr     4 write 1021.93 MB/s     2 x 511.69 = 1023.39 MB/s read  198.71 MB/s     2 x  99.37 =  198.75 MB/s 
total_size  8388608K rsz 1024 crg     2 thr     8 write  970.46 MB/s     2 x 485.71 =  971.41 MB/s read  201.85 MB/s     2 x 100.96 =  201.91 MB/s 
total_size  8388608K rsz 1024 crg     2 thr    16 write 1057.11 MB/s     2 x 529.13 = 1058.25 MB/s read  234.28 MB/s     2 x 117.17 =  234.34 MB/s 
total_size  8388608K rsz 1024 crg     2 thr    32 write  960.45 MB/s     2 x 480.69 =  961.38 MB/s read  211.48 MB/s     2 x 105.77 =  211.54 MB/s 
total_size  8388608K rsz 1024 crg     4 thr     4 write  709.30 MB/s     4 x 177.45 =  709.80 MB/s read  326.63 MB/s     4 x  81.68 =  326.73 MB/s 
total_size  8388608K rsz 1024 crg     4 thr     8 write  700.98 MB/s     4 x 175.37 =  701.48 MB/s read  282.53 MB/s     4 x  70.67 =  282.67 MB/s 
total_size  8388608K rsz 1024 crg     4 thr    16 write  752.53 MB/s     4 x 188.28 =  753.14 MB/s read  308.87 MB/s     4 x  77.24 =  308.95 MB/s 
total_size  8388608K rsz 1024 crg     4 thr    32 write  696.21 MB/s     4 x 174.18 =  696.72 MB/s read  280.55 MB/s     4 x  70.16 =  280.65 MB/s 
total_size  8388608K rsz 1024 crg     4 thr    64 write  690.79 MB/s     4 x 172.82 =  691.30 MB/s read  263.21 MB/s     4 x  65.82 =  263.29 MB/s 
total_size  8388608K rsz 1024 crg     8 thr     8 write  501.77 MB/s     8 x  62.75 =  502.01 MB/s read  325.21 MB/s     8 x  40.67 =  325.39 MB/s 
total_size  8388608K rsz 1024 crg     8 thr    16 write  506.74 MB/s     8 x  63.37 =  506.97 MB/s read  320.07 MB/s     8 x  40.03 =  320.21 MB/s 
total_size  8388608K rsz 1024 crg     8 thr    32 write  485.71 MB/s     8 x  60.75 =  485.99 MB/s read  353.40 MB/s     8 x  44.20 =  353.62 MB/s 
total_size  8388608K rsz 1024 crg     8 thr    64 write  501.04 MB/s     8 x  62.67 =  501.33 MB/s read  255.57 MB/s     8 x  31.96 =  255.66 MB/s 
total_size  8388608K rsz 1024 crg     8 thr   128 write  525.23 MB/s     8 x  65.71 =  525.67 MB/s read  378.57 MB/s     8 x  47.34 =  378.72 MB/s 
total_size  8388608K rsz 1024 crg    16 thr    16 write  383.54 MB/s    16 x  23.98 =  383.76 MB/s read  381.47 MB/s    16 x  23.85 =  381.62 MB/s 
total_size  8388608K rsz 1024 crg    16 thr    32 write  401.78 MB/s    16 x  25.12 =  401.92 MB/s read  392.46 MB/s    16 x  24.54 =  392.61 MB/s 
total_size  8388608K rsz 1024 crg    16 thr    64 write  418.10 MB/s    16 x  26.15 =  418.40 MB/s read  304.52 MB/s    16 x  19.04 =  304.57 MB/s 
total_size  8388608K rsz 1024 crg    16 thr   128 write  405.86 MB/s    16 x  25.38 =  406.04 MB/s read  325.64 MB/s    16 x  20.37 =  325.93 MB/s 
total_size  8388608K rsz 1024 crg    16 thr   256 write  389.65 MB/s    16 x  24.37 =  389.86 MB/s read  318.94 MB/s    16 x  19.94 =  319.06 MB/s 
total_size  8388608K rsz 1024 crg    32 thr    32 write  365.67 MB/s    32 x  11.43 =  365.91 MB/s read  184.33 MB/s    32 x   5.76 =  184.33 MB/s 
total_size  8388608K rsz 1024 crg    32 thr    64 write  352.64 MB/s    32 x  11.02 =  352.78 MB/s read  192.22 MB/s    32 x   6.01 =  192.26 MB/s 
total_size  8388608K rsz 1024 crg    32 thr   128 write  348.70 MB/s    32 x  10.90 =  348.82 MB/s read  239.66 MB/s    32 x   7.50 =  239.87 MB/s 
total_size  8388608K rsz 1024 crg    32 thr   256 write  299.37 MB/s    32 x   9.36 =  299.38 MB/s read  248.02 MB/s    32 x   7.75 =  248.11 MB/s 
total_size  8388608K rsz 1024 crg    32 thr   512 write  299.98 MB/s    32 x   9.37 =  299.99 MB/s read  229.41 MB/s    32 x   7.17 =  229.49 MB/s 
total_size  8388608K rsz 1024 crg    64 thr    64 write  273.48 MB/s    64 x   4.27 =  273.44 MB/s read  157.11 MB/s    64 x   2.45 =  156.86 MB/s 
total_size  8388608K rsz 1024 crg    64 thr   128 write  334.12 MB/s    64 x   5.23 =  334.47 MB/s read  184.61 MB/s    64 x   2.89 =  184.94 MB/s 
total_size  8388608K rsz 1024 crg    64 thr   256 write  298.72 MB/s    64 x   4.67 =  299.07 MB/s read  192.36 MB/s    64 x   3.00 =  192.26 MB/s 
total_size  8388608K rsz 1024 crg    64 thr   512 write  313.37 MB/s    64 x   4.90 =  313.72 MB/s read  193.55 MB/s    64 x   3.02 =  193.48 MB/s 
total_size  8388608K rsz 1024 crg    64 thr  1024 write  317.25 MB/s    64 x   4.96 =  317.38 MB/s read  191.37 MB/s    64 x   2.99 =  191.65 MB/s 
total_size  8388608K rsz 1024 crg   128 thr   128 write  297.69 MB/s   128 x   2.33 =  297.85 MB/s read  219.01 MB/s   128 x   1.71 =  218.51 MB/s 
total_size  8388608K rsz 1024 crg   128 thr   256 write  305.70 MB/s   128 x   2.39 =  306.40 MB/s read  209.97 MB/s   128 x   1.64 =  209.96 MB/s 
total_size  8388608K rsz 1024 crg   128 thr   512 write  276.41 MB/s   128 x   2.16 =  277.10 MB/s read  162.79 MB/s   128 x   1.27 =  162.35 MB/s 
total_size  8388608K rsz 1024 crg   128 thr  1024 write  301.00 MB/s   128 x   2.36 =  301.51 MB/s read  216.29 MB/s   128 x   1.69 =  216.06 MB/s 
total_size  8388608K rsz 1024 crg   128 thr  2048 write  258.84 MB/s   128 x   2.02 =  258.79 MB/s read  208.24 MB/s   128 x   1.63 =  208.74 MB/s 
total_size  8388608K rsz 1024 crg   256 thr   256 write  257.61 MB/s   256 x   1.01 =  258.79 MB/s read  222.66 MB/s   256 x   0.87 =  222.17 MB/s 
total_size  8388608K rsz 1024 crg   256 thr   512 write  254.39 MB/s   256 x   0.99 =  253.91 MB/s read  213.17 MB/s   256 x   0.83 =  212.40 MB/s 
total_size  8388608K rsz 1024 crg   256 thr  1024 write  247.27 MB/s   256 x   0.97 =  249.02 MB/s read  217.71 MB/s   256 x   0.85 =  217.29 MB/s 
total_size  8388608K rsz 1024 crg   256 thr  2048 write  257.63 MB/s   256 x   1.02 =  261.23 MB/s read  216.82 MB/s   256 x   0.85 =  217.29 MB/s 
total_size  8388608K rsz 1024 crg   256 thr  4096 write  278.90 MB/s   256 x   1.09 =  278.32 MB/s read  217.55 MB/s   256 x   0.85 =  217.29 MB/s 

You are correct, the ior test was with NSLOTS=1, I've also done tests with higher numbers without seeing any improvement. Here is a sample for NSLOTS=2 (which is also striped over 20 OSTs), I'll also run with 4 and 8 and update the call when I have the output. (I do get higher throughput if I select file-per-process but unfortunately that won't help us with the particular problem.)

[bnh65367@cs04r-sc-serv-66 ~]$ $MPIRUN ${MPIRUN_OPTS} -np 2 -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat  -k -t1m -b 20g -i 2 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Tue Jan 22 17:44:20 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20/ior_dat -k -w -t1m -b 20g -i 2 -e
Machine: Linux cs04r-sc-serv-66.diamond.ac.uk

Test 0 started: Tue Jan 22 17:44:20 2013
Summary:
	api                = POSIX
	test filename      = /mnt/lustre-test/frederik1/stripe-20/ior_dat
	access             = single-shared-file
	ordering in a file = sequential offsets
	ordering inter file= no tasks offsets
	clients            = 2 (2 per node)
	repetitions        = 2
	xfersize           = 1 MiB
	blocksize          = 20 GiB
	aggregate filesize = 40 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     407.56     20971520   1024.00    0.005184   100.50     0.001251   100.50     0   
write     404.99     20971520   1024.00    0.004892   101.13     0.001267   101.14     1   

Max Write: 407.56 MiB/sec (427.35 MB/sec)


Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         407.56     404.99     406.27       1.28  100.81957 0 2 2 2 0 0 1 0 0 1 21474836480 1048576 42949672960 POSIX 0

Finished: Tue Jan 22 17:48:36 2013
[bnh65367@cs04r-sc-serv-66 ~]$
Comment by Minh Diep [ 23/Jan/13 ]

according to the obdfilter-survey above, you have 41 osts in that oss. is that true? could you provide more info? how many ost in a oss, and how many oss total? are they sharing the same storage ... - thanks

Comment by Frederik Ferner (Inactive) [ 23/Jan/13 ]

We have 42 (8+2) LUNs in the SFA12K, for initial tests and as we don't expect to test metadata performance, we've got one of these LUNs used as MDT, this will be different storage in any final file systems.

We've currently got 41 OSTs. 4 OSSes are connected to the storage, each using direct connected dual FDR 56Gbit/s IB connections (using dual-port cards, so these two connections would have a total bandwidth not much higher than 56Gbit/s).

Each OSS has access to 21 LUNs. Without fail-over each OSS serves 10 or 11 OSTs.

Everything is sharing the same SFA12K.

The obdfilter-survey test was using all OSTs spread out over all OSTs (so 10-11 OSTs per OSS).

Comment by Minh Diep [ 23/Jan/13 ]

could you set stripe=-1 (all), xfersize=4M and try ior again?
then send me the btw_stats (/proc/fs/lustre/obdfilter/<OST>/brw_stats

Comment by Frederik Ferner (Inactive) [ 24/Jan/13 ]

Minh,

I've now repeated the tests with xfersize=1M for NSLOTS=1,2,4,8 gathering
brw_stats for all servers before the test and between each round. I've also
repeated the same tests with xfersize=4M for NSLOTS=1,2,4,8.

The full output is attached along with a tar file of all brw_stats files.

In the output youll see lines like this:

spfs1_brw_stats_20130124133533

These are the directory names containing all brw_stats files by server and OST
name collected at that point. Let me know if you want this in any other
format.

Here's a summary of the results (each tests with 2 iterations):

xfersize NSLOTS Mean throughput [MiB/s]
1M 1 451
1M 2 418
1M 4 312
1M 8 290
4M 1 467
4M 2 391
4M 4 331
4M 8 335

Cheers

Comment by Minh Diep [ 25/Jan/13 ]

Thanks for the results.

I think you are aware of LU-744.
Have you setup the bonding on the 10gig on the client?

Comment by Minh Diep [ 25/Jan/13 ]

could you also install master version (tag 2.3.59) on the client and try again?

Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ]

Yes, I'm aware of LU-744. I've got lost trying to identify if there are any patches I might want to apply. Are you thinking of any patches mentioned in that ticket that we should test?

The bonding on the client is LACP with these parameters: "mode=802.3ad xmit_hash_policy=2 miimon=100". I have verified using netperf that I can send about 1GB/s to one OSS using one stream and using two streams I can send ~2GB/s if I pick a suitable combination of OSSes.

FWIW, I have also tested the file-per-process option in ior with multiple processes and I've seen 1.6GB/s write throughput for IIRC 10 or 12 processes.

Today I've also tried the latest master from git (commit 57373a2, client that I compiled myself for the kernel I'm using for these tests, should I try tag 2.3.59 specifically?) and the version of Lustre 1.8.8(.60) that we use on all our other clients. Summarised results are below.

With the 1.8 client and checksums on I got about 520MiB/s and with checksums off 620MiB/s (haven't recorded the ior output though).

With master on the client I got this:

xfersize NSLOTS Mean throughput[MiB/s]
1M 1 422
1M 2 404
4M 1 467
4M 2 393
4M 4 316
4M 8 292
Comment by Peter Jones [ 25/Jan/13 ]

Frederik

A number of fixes from LU-744 have landed to master which is why Minh is interested in seeing the results from testing that. The tip of master is fine - 2.3.59 was just a suggestion.

Peter

Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ]

Peter,

thanks for letting me know.

Should I upload the full ior output and/or brw_stats for my tests with master or is the summary enough?

Cheers,
Frederik

Comment by Minh Diep [ 25/Jan/13 ]

just the ior is good. Please try both shared file and file per process. thanks

Comment by Frederik Ferner (Inactive) [ 25/Jan/13 ]

Minh,

I've attached the full output for a new ior run with master on the client, each test was done with 3 iterations.

The order of tests in this output was

  • xfersize=1M, single-shared-file for NSLOTS 1,2,4,8,
  • xfersize=4M, single-shared-file for NSLOTS=1,2,4,8,
  • xfersize=4M, file-per-process for NSLOTS=1,2,4,8
  • xfersize=1M, file-per-process for NSLOTS=1,2,4,8

As a summary these are the results:

Single shared file:

NSLOTS xfersize Mean throughput[MiB/s]
1 1M 438.56
2 1M 441.11
4 1M 290.91
8 1M 274.76
1 4M 466.49
2 4M 424.05
4 4M 346.02
8 4M 313.52

File per process:

NSLOTS xfersize Mean throughput[MiB/s]
1 4M 384.29
2 4M 711.36
4 4M 1129.83
8 4M 1476.87
1 1M 410.59
2 1M 698.79
4 1M 1083.53
8 1M 1513.72

I've noticed that at least in the first test for a single file, the difference in the individual test results is relatively big (308MiB/s to 510MiB/s), I've not investigated this any further yet.

Frederik

Comment by Minh Diep [ 28/Jan/13 ]

I think it's worth a try to increase the xfersize to 8, 16, 32, 64M.
is it also possible to add more oss?

Comment by Frederik Ferner (Inactive) [ 28/Jan/13 ]

I'm currently repeating the tests with larger xfersize as suggested.

It might be possible to add more OSS nodes but I'm not too confident that this will help us much, especially as in my tests so far the number of OSTs didn't seem to make much difference even when going down to 2 OSTs (so only using 2 OSSes max). AFAICT the obdfilter-survey shows that the OSSes can push the data do the storage fast enough. We should have sufficient network bandwidth available and according to our monitoring the OSS nodes are not busy.

On the subject of network bandwidth, do you have a good lnet_selftest script to verify what the network performance for lnet is for this system? Using netperf I was able to confirm that the basic TCP network performance is good.

Comment by Frederik Ferner (Inactive) [ 28/Jan/13 ]

Minh,

I've repeated the single-shared-file tests for larger xfersize as suggested,
I've restricted the maximum number of slots though and still running each test 3 times. Full ior output is
attached, summary below. It doesn't look like it did help, though.

xfersize NSLOTS Mean throughput [MiB/s]
8M 1 426.42
8M 2 337.41
8M 4 323.37
16M 1 399.02
16M 2 383.35
16M 4 287.67
32M 1 405.63
32M 2 405.16
32M 4 230.58
64M 1 421.64
64M 2 382.63
64M 4 215.39
Comment by Minh Diep [ 04/Feb/13 ]

Here is a sample script to run brw_test that I used before. You can edit to fit your env

#!/bin/sh

PATH=$PATH:/usr/sbin
SIZE=1M
USAGE="usage: $0 -s server_list -c client_list -k session_key -r start|stop -S size"
while getopts :s:c:k:r:S: opt_char
do
case $opt_char in
s) S=$OPTARG;;
c) C=$OPTARG;;
k) KEY=$OPTARG;;
r) STATE=$OPTARG;;
S) SIZE=$OPTARG;;
echo "The $OPTARG option requires an argument."
exit 1;;
?) echo "$OPTARG is not a valid option."
echo "$USAGE"
exit 1;;
esac
done

C=xxx@o2ib1
S=xxx@o2ib1

C_COUNT=`echo $C | wc -w`
S_COUNT=`echo $S | wc -w`
case "$STATE" in
start)
#try to clear the old session if any
export LST_SESSION=`lst show_session 2>/dev/null | awk -F " " '

{print $5}

'`
[ "$LST_SESSION" != "" ] && lst end_session
export LST_SESSION=$KEY
lst new_session --timeo 100000 hh
lst add_group c $C
lst add_group s $S
lst add_batch b
lst add_test --batch b --loop 1800000 --concurrency 32 \
--distribute $C_COUNT:$S_COUNT --from c \
--to s brw read check=full size=$SIZE
lst add_test --batch b --loop 1800000 --concurrency 32 \
--distribute $C_COUNT:$S_COUNT --from c \
--to s brw write check=full size=$SIZE
lst run b
sleep 5
lst list_batch b
echo ""
lst stat --delay 20 c s
;;
stop)
export LST_SESSION=$KEY
lst show_error c s
lst stop b
lst end_session
;;
esac

Comment by Frederik Ferner (Inactive) [ 06/Feb/13 ]

Thanks for the lnet_selftest script, though it was a bit hard to read with the formating etc...

I've quickly run that now on one client and two servers, output below:

[bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
cs04r-sc-serv-68-10g are added to session
cs04r-sc-oss05-03-10g are added to session
cs04r-sc-oss05-04-10g are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  2       0       0       0       2
        Test 1(brw) (loop: 1800000, concurrency: 32)
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  2       0       0       0       2
        Test 2(brw) (loop: 1800000, concurrency: 32)
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  2       0       0       0       2

[LNet Rates of c]
[R] Avg: 7704     RPC/s Min: 7704     RPC/s Max: 7704     RPC/s
[W] Avg: 5695     RPC/s Min: 5695     RPC/s Max: 5695     RPC/s
[LNet Bandwidth of c]
[R] Avg: 2011.80  MB/s  Min: 2011.80  MB/s  Max: 2011.80  MB/s
[W] Avg: 1841.79  MB/s  Min: 1841.79  MB/s  Max: 1841.79  MB/s
[LNet Rates of s]
[R] Avg: 2849     RPC/s Min: 2208     RPC/s Max: 3490     RPC/s
[W] Avg: 3853     RPC/s Min: 3045     RPC/s Max: 4661     RPC/s
[LNet Bandwidth of s]
[R] Avg: 921.26   MB/s  Min: 683.45   MB/s  Max: 1159.07  MB/s
[W] Avg: 1005.93  MB/s  Min: 840.24   MB/s  Max: 1171.62  MB/s
[LNet Rates of c]
[R] Avg: 7634     RPC/s Min: 7634     RPC/s Max: 7634     RPC/s
[W] Avg: 5634     RPC/s Min: 5634     RPC/s Max: 5634     RPC/s
[LNet Bandwidth of c]
[R] Avg: 1998.43  MB/s  Min: 1998.43  MB/s  Max: 1998.43  MB/s
[W] Avg: 1819.24  MB/s  Min: 1819.24  MB/s  Max: 1819.24  MB/s
[LNet Rates of s]
[R] Avg: 2818     RPC/s Min: 2137     RPC/s Max: 3499     RPC/s
[W] Avg: 3816     RPC/s Min: 2961     RPC/s Max: 4672     RPC/s
[LNet Bandwidth of s]
[R] Avg: 909.21   MB/s  Min: 656.14   MB/s  Max: 1162.28  MB/s
[W] Avg: 998.65   MB/s  Min: 823.79   MB/s  Max: 1173.52  MB/s
[LNet Rates of c]
[R] Avg: 7322     RPC/s Min: 7322     RPC/s Max: 7322     RPC/s
[W] Avg: 5409     RPC/s Min: 5409     RPC/s Max: 5409     RPC/s
[LNet Bandwidth of c]
[R] Avg: 1914.47  MB/s  Min: 1914.47  MB/s  Max: 1914.47  MB/s
[W] Avg: 1747.85  MB/s  Min: 1747.85  MB/s  Max: 1747.85  MB/s
[LNet Rates of s]
[R] Avg: 2704     RPC/s Min: 1897     RPC/s Max: 3510     RPC/s
[W] Avg: 3660     RPC/s Min: 2636     RPC/s Max: 4685     RPC/s
[LNet Bandwidth of s]
[R] Avg: 873.47   MB/s  Min: 579.29   MB/s  Max: 1167.64  MB/s
[W] Avg: 956.83   MB/s  Min: 738.93   MB/s  Max: 1174.73  MB/s
[LNet Rates of c]
[R] Avg: 7580     RPC/s Min: 7580     RPC/s Max: 7580     RPC/s
[W] Avg: 5594     RPC/s Min: 5594     RPC/s Max: 5594     RPC/s
[LNet Bandwidth of c]
[R] Avg: 1988.69  MB/s  Min: 1988.69  MB/s  Max: 1988.69  MB/s
[W] Avg: 1803.03  MB/s  Min: 1803.03  MB/s  Max: 1803.03  MB/s
[LNet Rates of s]
[R] Avg: 2796     RPC/s Min: 2112     RPC/s Max: 3480     RPC/s
[W] Avg: 3789     RPC/s Min: 2927     RPC/s Max: 4650     RPC/s
[LNet Bandwidth of s]
[R] Avg: 901.00   MB/s  Min: 647.69   MB/s  Max: 1154.30  MB/s
[W] Avg: 993.80   MB/s  Min: 817.02   MB/s  Max: 1170.58  MB/s
[LNet Rates of c]
[R] Avg: 8064     RPC/s Min: 8064     RPC/s Max: 8064     RPC/s
[W] Avg: 5957     RPC/s Min: 5957     RPC/s Max: 5957     RPC/s
[LNet Bandwidth of c]
[R] Avg: 2105.40  MB/s  Min: 2105.40  MB/s  Max: 2105.40  MB/s
[W] Avg: 1926.91  MB/s  Min: 1926.91  MB/s  Max: 1926.91  MB/s
[LNet Rates of s]
[R] Avg: 2973     RPC/s Min: 2468     RPC/s Max: 3479     RPC/s
[W] Avg: 4026     RPC/s Min: 3403     RPC/s Max: 4648     RPC/s
[LNet Bandwidth of s]
[R] Avg: 961.77   MB/s  Min: 768.23   MB/s  Max: 1155.32  MB/s
[W] Avg: 1050.98  MB/s  Min: 932.80   MB/s  Max: 1169.15  MB/s
[LNet Rates of c]
[R] Avg: 7601     RPC/s Min: 7601     RPC/s Max: 7601     RPC/s
[W] Avg: 5624     RPC/s Min: 5624     RPC/s Max: 5624     RPC/s
[LNet Bandwidth of c]
[R] Avg: 1977.67  MB/s  Min: 1977.67  MB/s  Max: 1977.67  MB/s
[W] Avg: 1824.20  MB/s  Min: 1824.20  MB/s  Max: 1824.20  MB/s
[LNet Rates of s]
[R] Avg: 2814     RPC/s Min: 2173     RPC/s Max: 3454     RPC/s
[W] Avg: 3802     RPC/s Min: 2993     RPC/s Max: 4610     RPC/s
[LNet Bandwidth of s]
[R] Avg: 912.18   MB/s  Min: 676.17   MB/s  Max: 1148.19  MB/s
[W] Avg: 988.94   MB/s  Min: 820.77   MB/s  Max: 1157.10  MB/s
No session exists

This was running until I terminated it in another window:

[bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r stop
c:
Total 0 error nodes in c
s:
Total 0 error nodes in s
1 batch in stopping
Batch is stopped
session is ended
[bnh65367@cs04r-sc-serv-68 bin]$

This was done using 2.3.59 on the client and 2.3.0 on the servers. The client
is the same hardware as during previous tests and network configuration.

Comment by Frederik Ferner (Inactive) [ 06/Feb/13 ]

The throughput in the previous test was good, though I've noticed that the throughput seems to drop to about 550MB/s if I use one client, one server and reduce concurrency to 1, I wonder if that is related to the single stream performance that we experience? Is the client effectively only ever writing to one server at a time or something similar?

[bnh65367@cs04r-sc-serv-68 bin]$ sudo ./lnet-selftest-wc.sh -k 111 -r start -s cs04r-sc-oss05-03-10g -C 1
CONCURRENCY=1
session is ended
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
cs04r-sc-serv-68-10g are added to session
cs04r-sc-oss05-03-10g are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  1       0       0       0       1
        Test 1(brw) (loop: 1800000, concurrency: 1)
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  1       0       0       0       1
        Test 2(brw) (loop: 1800000, concurrency: 1)
        ACTIVE  BUSY    DOWN    UNKNOWN TOTAL
client  1       0       0       0       1
server  1       0       0       0       1

[LNet Rates of c]
[R] Avg: 2245     RPC/s Min: 2245     RPC/s Max: 2245     RPC/s
[W] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
[LNet Bandwidth of c]
[R] Avg: 557.79   MB/s  Min: 557.79   MB/s  Max: 557.79   MB/s
[W] Avg: 565.19   MB/s  Min: 565.19   MB/s  Max: 565.19   MB/s
[LNet Rates of s]
[R] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
[W] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
[LNet Bandwidth of s]
[R] Avg: 564.91   MB/s  Min: 564.91   MB/s  Max: 564.91   MB/s
[W] Avg: 557.47   MB/s  Min: 557.47   MB/s  Max: 557.47   MB/s
[LNet Rates of c]
[R] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
[W] Avg: 1689     RPC/s Min: 1689     RPC/s Max: 1689     RPC/s
[LNet Bandwidth of c]
[R] Avg: 556.52   MB/s  Min: 556.52   MB/s  Max: 556.52   MB/s
[W] Avg: 566.62   MB/s  Min: 566.62   MB/s  Max: 566.62   MB/s
[LNet Rates of s]
[R] Avg: 1690     RPC/s Min: 1690     RPC/s Max: 1690     RPC/s
[W] Avg: 2246     RPC/s Min: 2246     RPC/s Max: 2246     RPC/s
[LNet Bandwidth of s]
[R] Avg: 566.36   MB/s  Min: 566.36   MB/s  Max: 566.36   MB/s
[W] Avg: 556.22   MB/s  Min: 556.22   MB/s  Max: 556.22   MB/s
[LNet Rates of c]
[R] Avg: 2250     RPC/s Min: 2250     RPC/s Max: 2250     RPC/s
[W] Avg: 1690     RPC/s Min: 1690     RPC/s Max: 1690     RPC/s
[LNet Bandwidth of c]
[R] Avg: 559.59   MB/s  Min: 559.59   MB/s  Max: 559.59   MB/s
[W] Avg: 565.44   MB/s  Min: 565.44   MB/s  Max: 565.44   MB/s
[LNet Rates of s]
[R] Avg: 1691     RPC/s Min: 1691     RPC/s Max: 1691     RPC/s
[W] Avg: 2250     RPC/s Min: 2250     RPC/s Max: 2250     RPC/s
[LNet Bandwidth of s]
[R] Avg: 565.11   MB/s  Min: 565.11   MB/s  Max: 565.11   MB/s
[W] Avg: 559.31   MB/s  Min: 559.31   MB/s  Max: 559.31   MB/s
[LNet Rates of c]
[R] Avg: 2248     RPC/s Min: 2248     RPC/s Max: 2248     RPC/s
[W] Avg: 1688     RPC/s Min: 1688     RPC/s Max: 1688     RPC/s
[LNet Bandwidth of c]
[R] Avg: 560.44   MB/s  Min: 560.44   MB/s  Max: 560.44   MB/s
[W] Avg: 563.74   MB/s  Min: 563.74   MB/s  Max: 563.74   MB/s
[LNet Rates of s]
[R] Avg: 1687     RPC/s Min: 1687     RPC/s Max: 1687     RPC/s
[W] Avg: 2247     RPC/s Min: 2247     RPC/s Max: 2247     RPC/s
[LNet Bandwidth of s]
[R] Avg: 563.46   MB/s  Min: 563.46   MB/s  Max: 563.46   MB/s
[W] Avg: 560.16   MB/s  Min: 560.16   MB/s  Max: 560.16   MB/s
Comment by Minh Diep [ 14/Feb/13 ]

yes, it's because you set concurrency=1. it's like running a single thread.

Comment by Minh Diep [ 14/Feb/13 ]

hi, could you print out the lctl dl -t from your client? -thanks

Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

Minh,

not sure if I mentioned this, but I had to reduce my file system to only 20 OSTs on 2 OSSes as I had to start investigating alternatives on the rest of the hardware. Here is the requested lctl dl -t output.

[bnh65367@cs04r-sc-serv-68 frederik1]$ lctl dl -t
0 UP mgc MGC172.23.66.29@tcp 59376037-38a3-b21b-689c-217c3f9bd463 5
1 UP lov spfs1-clilov-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4
2 UP lmv spfs1-clilmv-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 4
3 UP mdc spfs1-MDT0000-mdc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
4 UP osc spfs1-OST0001-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
5 UP osc spfs1-OST0002-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
6 UP osc spfs1-OST0003-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
7 UP osc spfs1-OST0004-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
8 UP osc spfs1-OST0005-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
9 UP osc spfs1-OST0006-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
10 UP osc spfs1-OST0007-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
11 UP osc spfs1-OST0008-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
12 UP osc spfs1-OST0009-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
13 UP osc spfs1-OST0000-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.29@tcp
14 UP osc spfs1-OST000a-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
15 UP osc spfs1-OST000b-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
16 UP osc spfs1-OST000c-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
17 UP osc spfs1-OST000d-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
18 UP osc spfs1-OST000e-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
19 UP osc spfs1-OST000f-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
20 UP osc spfs1-OST0010-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
21 UP osc spfs1-OST0011-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
22 UP osc spfs1-OST0012-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
23 UP osc spfs1-OST0013-osc-ffff880829227800 ff8b265f-5e64-8652-4083-5811a31faa63 5 172.23.66.30@tcp
[bnh65367@cs04r-sc-serv-68 frederik1]$

Comment by Minh Diep [ 15/Feb/13 ]

Frederik,

I suggest we measure how much an OSS can deliver to one client. You can achieve this by lfs setstripe -c 10 -o 1 <ior dir>. Please try ior with -np 1 and let me know.

Thanks

Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

Minh,

570MiB/s see below (for just one iteration)

[bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir single-oss
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 1 single-oss/
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss
[bnh65367@cs04r-sc-serv-68 frederik1]$ export NSLOTS=1
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Fri Feb 15 16:37:50 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss/ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk

Test 0 started: Fri Feb 15 16:37:50 2013
Summary:
        api                = POSIX
        test filename      = /mnt/lustre-test/frederik1/single-oss/ior_dat
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 20 GiB
        aggregate filesize = 20 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     570.96     20971520   1024.00    0.000590   35.87      0.000212   35.87      0   

Max Write: 570.96 MiB/sec (598.69 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         570.96     570.96     570.96       0.00   35.86953 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0

Finished: Fri Feb 15 16:38:26 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$ 
Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

Also, the reason I tried the lnet selftest with concurrency one is because my feeling is that this might be close to what happens for single process writes. Looking at the two numbers in throughput (concurrency=1 lnet selftest and single process ior), these see very close to each other.

Also the other day I did a test while watching /proc/sys/lnet/peers every 1/10th second and there was only ever one of the two nids with anything reported as queued. Not sure if this is relevant or not...

Comment by Minh Diep [ 15/Feb/13 ]

Hi Frederick,

Thanks for the quick response. Could you try again with lfs setstripe -c 20? thanks

Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

here you go (this seems slower?):

[bnh65367@cs04r-sc-serv-68 frederik1]$ mkdir stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 20 stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/stripe-20-1
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Fri Feb 15 17:01:36 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/stripe-20-1/ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk

Test 0 started: Fri Feb 15 17:01:36 2013
Summary:
        api                = POSIX
        test filename      = /mnt/lustre-test/frederik1/stripe-20-1/ior_dat
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 20 GiB
        aggregate filesize = 20 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     412.76     20971520   1024.00    0.000905   49.62      0.000229   49.62      0   

Max Write: 412.76 MiB/sec (432.81 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         412.76     412.76     412.76       0.00   49.61738 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0

Finished: Fri Feb 15 17:02:25 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$ 
Comment by Minh Diep [ 15/Feb/13 ]

this is very strange to me that the write is worse than single oss. I have tested in my lab and the write is about double with two oss.

Comment by Minh Diep [ 15/Feb/13 ]

We could do a few things to see:
1. verify that the stripe actually to all the ost: lfs getstripe files.
2. verify that you get the same on the other oss lfs setstripe -c 10 -o a
3. run iostat during the test to see how much each LUN is writing.
4. print one /proc/fs/lustre/osc/<OST>/rpc_stats

Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

just to confirm

for 2) I assume it is -o 1 (and not a)?
for 3) iostat on the OSS or the client?

(And yes it looks like we are using all 20 OSTs:

[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe /mnt/lustre-test/frederik1/stripe-20-1/ior_dat
/mnt/lustre-test/frederik1/stripe-20-1/ior_dat
lmm_stripe_count:   20
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  14
        obdidx           objid          objid            group
            14            1380          0x564                0
             6            1320          0x528                0
            15            1380          0x564                0
             7            1319          0x527                0
            16            1380          0x564                0
             8            1319          0x527                0
            17            1380          0x564                0
             9            1320          0x528                0
            18            1380          0x564                0
             0            1326          0x52e                0
            19            1380          0x564                0
             1            1323          0x52b                0
            10            1380          0x564                0
             2            1320          0x528                0
            11            1380          0x564                0
             3            1320          0x528                0
            12            1476          0x5c4                0
             4            1321          0x529                0
            13            1380          0x564                0
             5            1320          0x528                0
Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

ok, going through the suggestions one step at a time (after playing with lfs setstripe -o a bit, I think I've got it.)

ior on just the second OSS:

[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs setstripe -c 10 -o 0xa single-oss-2/
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe single-oss-2/
single-oss-2/
stripe_count:   10 stripe_size:    1048576 stripe_offset:  10
[bnh65367@cs04r-sc-serv-68 frederik1]$ export IORTESTDIR=/mnt/lustre-test/frederik1/single-oss-2/
[bnh65367@cs04r-sc-serv-68 frederik1]$ $MPIRUN ${MPIRUN_OPTS} -np $NSLOTS -machinefile ${TMPDIR}/hostfile /home/bnh65367/code/ior/src/ior -o ${IORTESTDIR}/ior_dat -w -k -t1m -b 20g -i 1 -e
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Fri Feb 15 17:33:57 2013
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre-test/frederik1/single-oss-2//ior_dat -w -k -t1m -b 20g -i 1 -e
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk

Test 0 started: Fri Feb 15 17:33:57 2013
Summary:
        api                = POSIX
        test filename      = /mnt/lustre-test/frederik1/single-oss-2//ior_dat
        access             = single-shared-file
        ordering in a file = sequential offsets
        ordering inter file= no tasks offsets
        clients            = 1 (1 per node)
        repetitions        = 1
        xfersize           = 1 MiB
        blocksize          = 20 GiB
        aggregate filesize = 20 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     468.96     20971520   1024.00    0.000589   43.67      0.000200   43.67      0

Max Write: 468.96 MiB/sec (491.74 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         468.96     468.96     468.96       0.00   43.67131 0 1 1 1 0 0 1 0 0 1 21474836480 1048576 21474836480 POSIX 0

Finished: Fri Feb 15 17:34:41 2013
[bnh65367@cs04r-sc-serv-68 frederik1]$ lfs getstripe single-oss-2/ior_dat 
single-oss-2/ior_dat
lmm_stripe_count:   10
lmm_stripe_size:    1048576
lmm_layout_gen:     0
lmm_stripe_offset:  10
        obdidx           objid          objid            group
            10            1381          0x565                0
            11            1381          0x565                0
            12            1477          0x5c5                0
            13            1381          0x565                0
            14            1381          0x565                0
            15            1381          0x565                0
            16            1381          0x565                0
            17            1381          0x565                0
            18            1381          0x565                0
            19            1381          0x565                0

rpc_stats for one OST:

[bnh65367@cs04r-sc-serv-68 frederik1]$ cat /proc/fs/lustre/osc/spfs1-OST0007-osc-ffff880829227800/rpc_stats 
snapshot_time:         1360950242.989930 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |          0   0   0
64:                      0   0   0   |          0   0   0
128:                     0   0   0   |          0   0   0
256:                     0   0   0   |      25589 100 100

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          0   0   0
1:                       0   0   0   |       6979  27  27
2:                       0   0   0   |       3272  12  40
3:                       0   0   0   |       4127  16  56
4:                       0   0   0   |        644   2  58
5:                       0   0   0   |        515   2  60
6:                       0   0   0   |        561   2  62
7:                       0   0   0   |       1378   5  68
8:                       0   0   0   |       2442   9  77
9:                       0   0   0   |       3931  15  93
10:                      0   0   0   |       1725   6  99
11:                      0   0   0   |         15   0 100

                        read                    write
offset                rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |         17   0   0
1:                       0   0   0   |          0   0   0
2:                       0   0   0   |          0   0   0
4:                       0   0   0   |          0   0   0
8:                       0   0   0   |          0   0   0
16:                      0   0   0   |          0   0   0
32:                      0   0   0   |          0   0   0
64:                      0   0   0   |          0   0   0
128:                     0   0   0   |          0   0   0
256:                     0   0   0   |         17   0   0
512:                     0   0   0   |         34   0   0
1024:                    0   0   0   |         68   0   0
2048:                    0   0   0   |        136   0   1
4096:                    0   0   0   |        272   1   2
8192:                    0   0   0   |        544   2   4
16384:                   0   0   0   |       1088   4   8
32768:                   0   0   0   |       2125   8  16
65536:                   0   0   0   |       4096  16  32
131072:                  0   0   0   |       7976  31  63
262144:                  0   0   0   |       3072  12  75
524288:                  0   0   0   |       4096  16  91
1048576:                         0   0   0   |       2048   8 100
[bnh65367@cs04r-sc-serv-68 frederik1]$ 
Comment by Frederik Ferner (Inactive) [ 15/Feb/13 ]

And another test with all OSTs:

Cut down iostat over 10 second interval in the middle of the test on first OSS, with ost details added. Not that this was fairly constant over the whole test. Note also that I was running iostat 10, so the result is in blocks.

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3              0.00         0.00         0.00          0          0
dm-4              0.00         0.00         0.00          0          0
dm-5              0.00         0.00         0.00          0          0
dm-6             26.90         1.60     53467.20         16     534672 (ost13)
dm-7              0.00         0.00         0.00          0          0
dm-8             26.90         1.60     53468.00         16     534680 (ost18)
dm-9             26.80         0.80     53467.20          8     534672 (ost11)
dm-10             0.00         0.00         0.00          0          0
dm-11            26.80         1.60     53262.40         16     532624 (ost16)
dm-12            26.90         1.60     53467.20         16     534672 (ost19)
dm-13            26.90         1.60     53468.00         16     534680 (ost17)
dm-14            26.60         0.00     53262.40          0     532624 (ost14)
dm-15            26.90         1.60     53467.20         16     534672 (ost10)
dm-16            26.90         1.60     53467.20         16     534672 (ost15)
dm-17            26.80         1.60     53262.40         16     532624 (ost14)
dm-18             0.00         0.00         0.00          0          0
dm-19             0.00         0.00         0.00          0          0
dm-20             0.00         0.00         0.00          0          0
dm-21             0.00         0.00         0.00          0          0
dm-22             0.00         0.00         0.00          0          0
dm-23             0.00         0.00         0.00          0          0

Cut down iostat over 10 second interval in the middle of the test on second OSS:

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
dm-3             26.60         0.00     53665.60          0     536656 (ost0)
dm-4             26.60         0.00     53467.20          0     534672 (ost5)
dm-5             26.70         0.00     53467.20          0     534672 (ost6)
dm-6             26.80         0.00     53672.00          0     536720 (ost1)
dm-7             26.60         0.00     53466.40          0     534664 (ost7)
dm-8             26.70         0.00     53467.20          0     534672 (ost8)
dm-9              1.10         0.00         9.60          0         96 (mdt)
dm-10            26.70         0.00     53467.20          0     534672 (ost9)
dm-11             0.00         0.00         0.00          0          0
dm-12             0.00         0.00         0.00          0          0
dm-13             0.00         0.00         0.00          0          0
dm-14            26.80         0.00     53672.00          0     536720 (ost3)
dm-15            26.80         0.00     53672.00          0     536720 (ost4)
dm-16             0.00         0.00         0.00          0          0
dm-17             0.00         0.00         0.00          0          0
dm-18            26.80         0.00     53672.00          0     536720 (ost2)
dm-19             0.00         0.00         0.00          0          0
dm-20             0.00         0.00         0.00          0          0
dm-21             0.00         0.00         0.00          0          0
dm-22             0.00         0.00         0.00          0          0
dm-23             0.00         0.00         0.00          0          0

IOR reported a throughput of about 520MiB/s. Traffic seems fairly balanced to me.

Can I just compare exact Lustre versions? I'm still using lustre 2.3.0 on the servers and 2.5.59 on the clients.

Comment by Minh Diep [ 15/Feb/13 ]

I am using 2.3.0 on the server and latest lustre-master from yesterday.

Comment by Frederik Ferner (Inactive) [ 18/Feb/13 ]

Minh,

just to confirm, on your test system you are running over 10GigE? (bonded links?) And do you get close to what I get on a single OSS or much less? What is the approximate throughput you get with two OSSes over 10GigE?

Frederik

Comment by Minh Diep [ 18/Feb/13 ]

No, I am running over IB. My small setup is just 2 SATA drive on each OSS, but I was able to scale linearly up to 4 OSS which reached about 740MB/s. I am trying to reconfigure to try to achieve 1GB/s, either add more oss or more disk per oss.

Comment by Minh Diep [ 18/Feb/13 ]

it puzzle me that your two oss combine performs slower than a single. Something is not configure correct here. If you still have the system, we should experiment a run on each individual OST.

1. create a ost_pool on first OST on one OSS and setstripe on that pool, run ior
2. create a ost_pool on first OST on both OSS and setstripe 2 on that pool, run ior
3. scale this up to all OST on all OSS to see where we are slow down.

Thanks

Comment by Minh Diep [ 18/Feb/13 ]

Is it possible to have remote access to your cluster? Please let me know

Comment by Frederik Ferner (Inactive) [ 18/Feb/13 ]

From what I've gathered from other sites, performance over IB seems different from 10GigE. Unfortunately we won't be able to change our infrastructure to IB as part of this project.

Remote access to the test system should be possible, let's discuss details on that over private email, if you don't mind.

Frederik

Comment by Minh Diep [ 28/Jun/13 ]

Frederick,

Is there anything else that needs to be done on this ticket?

Generated at Sat Feb 10 01:27:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.