[LU-5754] Lustre 2.6 client performance running on 2.5 production system Created: 16/Oct/14  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Dave Bond (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: Text File collectl.txt     Text File rpc_stats.txt    
Rank (Obsolete): 16150

 Description   

After LAD this year the suggestion about our performance issues with single client single steam data was to try out the 2.6 Client. We are aiming for 900MB/s single stream single client performance.
Could I ask if we were to use this in production would this be covered by our support contract?

I am seeing on average a 66% increase in performance from the 2.5 to the 2.6 client. This is giving us approximately 650MiB/s from a single client running IOR.

Run as below
/dls_sw/apps/openmpi/1.4.3/64/bin/mpirun -mca btl self,tcp,sm -np 1 /home/bnh65367/code/ior/src/ior -o /mnt/lustre03/testdir/dave/ior_dat -w -r -k -t4m -S -b 10G -i 1 -e -a POSIX

This agrees with dd used as a crude way to measure single stream performance, and real world testing. Though iozone achieves much higher figures being approximately 900MB/s the relationship is still an proximate 60% improvement.

I believe the IOR figures to be more like what we would see with one of our detectors and I do not fully understand yet why iozone achieves so much better results.

iozone.x86_64 -i 0 -r 4M -s 10G -t 1

But I would like to know if you feel we can get IOR to run at 900MB/s. The testing I have done so far with real world tests and benchmarking is consistently showing the 600MB/s performance. Andreas was of the opinion I should be able to achieve 900MB/s with the 2.6 client.

Is there any tuning that you can think of that might benefit us. We can already see the file striped across all OST's



 Comments   
Comment by Peter Jones [ 16/Oct/14 ]

Dave

Yes this configuration would certainly be supported.

Jinshan

Is there any advice that you can provide to Dave?

Thanks

Peter

Comment by Jinshan Xiong (Inactive) [ 16/Oct/14 ]

Hi Dave,

In my latest test, single client single thread writing speed should be able to reach 1.3GB/s. Do you know how fast a single OST it is in your configuration? If possible, I'd like to start with single striped file.

Please collect some stats while IOR and iozone is running:
1. CPU and memory usage information;
2. collect rpc_stats on the client side after the test is complete. Make sure the rpc_stats is cleared by: lctl set_param osc.*.rpc_stats=clear before the test starts.

Also make sure the debug is turned off, you can also try to turn off checksum and see how good it is.

I will do further analysis once I've got these info.

Jinshan

Comment by Andreas Dilger [ 16/Oct/14 ]

Also, what is the CPU/RAM on the client? Some operations like copying data from userspace to the kernel are CPU bound, so having a faster GHz CPU should improve performance of the single-threaded case. We haven't done much testing on this yet.

Comment by Dave Bond (Inactive) [ 20/Oct/14 ]

This is the performance from a stripe count of 1

[joe59240@cs04r-sc-serv-68 dave]$ sudo lfs setstripe -c 1 /mnt/lustre03/testdir/dave/
[joe59240@cs04r-sc-serv-68 dave]$ lfs getstripe /mnt/lustre03/testdir/dave/
/mnt/lustre03/testdir/dave/
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
/mnt/lustre03/testdir/dave//dd-test
lmm_stripe_count:   30
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  4
    obdidx         objid         objid         group
         4          52196121        0x31c7319                 0
        26          53264792        0x32cc198                 0
         9          52897292        0x327260c                 0
        17          52485050        0x320dbba                 0
         5          52788704        0x3257de0                 0
        29          52833246        0x3262bde                 0
        14          52785759        0x325725f                 0
        16          52658853        0x32382a5                 0
         0          53020814        0x329088e                 0
        12          52817200        0x325ed30                 0
        18          52858751        0x3268f7f                 0
        24          53058169        0x3299a79                 0
         1          53232395        0x32c430b                 0
        13          52599697        0x3229b91                 0
        21          51907308        0x3180aec                 0
        23          52559058        0x321fcd2                 0
         2          52421528        0x31fe398                 0
         8          52819310        0x325f56e                 0
        20          53108899        0x32a60a3                 0
        28          53012365        0x328e78d                 0
        27          53149873        0x32b00b1                 0
        11          52740508        0x324c19c                 0
        15          53099667        0x32a3c93                 0
         3          53045067        0x329674b                 0
        10          52926727        0x3279907                 0
        22          52342894        0x31eb06e                 0
         6          51948916        0x318ad74                 0
        25          52317516        0x31e4d4c                 0
         7          52712325        0x3245385                 0
        19          52586950        0x32269c6                 0

/mnt/lustre03/testdir/dave//ior_dat
lmm_stripe_count:   30
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  29
    obdidx         objid         objid         group
        29          52921862        0x3278606                 0
        14          52874360        0x326cc78                 0
        16          52747467        0x324dccb                 0
         0          53109419        0x32a62ab                 0
        12          52905806        0x327474e                 0
        18          52947353        0x327e999                 0
        24          53146774        0x32af496                 0
         1          53321004        0x32d9d2c                 0
        13          52688306        0x323f5b2                 0
        21          51995910        0x3196506                 0
        23          52647663        0x32356ef                 0
         2          52510135        0x3213db7                 0
         8          52907923        0x3274f93                 0
        20          53197508        0x32bbac4                 0
        28          53100974        0x32a41ae                 0
        27          53238481        0x32c5ad1                 0
        11          52829113        0x3261bb9                 0
        15          53188281        0x32b96b9                 0
         3          53133679        0x32ac16f                 0
        10          53015333        0x328f325                 0
        22          52431496        0x3200a88                 0
         6          52037517        0x31a078d                 0
        25          52406118        0x31fa766                 0
         7          52800937        0x325ada9                 0
        19          52675557        0x323c3e5                 0
         4          52284727        0x31dcd37                 0
        26          53353395        0x32e1bb3                 0
         9          52985895        0x3288027                 0
        17          52573656        0x32235d8                 0
         5          52877308        0x326d7fc                 0

IOR test output:

[joe59240@cs04r-sc-serv-68 dave]$ /dls_sw/apps/openmpi/1.4.3/64/bin/mpirun -mca btl self,tcp,sm -np 1 /home/bnh65367/code/ior/src/ior -o /mnt/lustre03/testdir/dave/ior_dat -w -r -k -t1m -S -b 10G -i 1 -e -a POSIX
ior WARNING: strided datatype only available in MPIIO.  Using value of 0.
IOR-3.0.0: MPI Coordinated Test of Parallel I/O

Began: Mon Oct 20 11:34:33 2014
Command line used: /home/bnh65367/code/ior/src/ior -o /mnt/lustre03/testdir/dave/ior_dat -w -r -k -t1m -S -b 10G -i 1 -e -a POSIX
Machine: Linux cs04r-sc-serv-68.diamond.ac.uk

Test 0 started: Mon Oct 20 11:34:33 2014
Summary:
    api                = POSIX
    test filename      = /mnt/lustre03/testdir/dave/ior_dat
    access             = single-shared-file
    ordering in a file = sequential offsets
    ordering inter file= no tasks offsets
    clients            = 1 (1 per node)
    repetitions        = 1
    xfersize           = 1 MiB
    blocksize          = 10 GiB
    aggregate filesize = 10 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   --------   ----
write     281.06     10485760   1024.00    0.000305   36.43      0.000271   36.43      0  
read      878.49     10485760   1024.00    0.000173   11.66      0.000015   11.66      0  

Max Write: 281.06 MiB/sec (294.71 MB/sec)
Max Read:  878.49 MiB/sec (921.17 MB/sec)

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev    Mean(s) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize API RefNum
write         281.06     281.06     281.06       0.00   36.43340 0 1 1 1 0 0 1 0 0 1 10737418240 1048576 10737418240 POSIX 0
read          878.49     878.49     878.49       0.00   11.65634 0 1 1 1 0 0 1 0 0 1 10737418240 1048576 10737418240 POSIX 0

Finished: Mon Oct 20 11:35:21 2014
[joe59240@cs04r-sc-serv-68 dave]$

The same with iozone

[joe59240@cs04r-sc-serv-68 dave]$ sudo /mnt/lustre03/testdir/iozone.x86_64 -i 0 -r 4M -s 10G -t 1     Iozone: Performance Test of File I/O
            Version $Revision: 3.283 $
        Compiled for 64 bit mode.
        Build: linux

    Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
                 Al Slater, Scott Rhine, Mike Wisner, Ken Goss
                 Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
                 Randy Dunlap, Mark Montague, Dan Million,
                 Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
                 Erik Habbinga, Kris Strecker, Walter Wong.

    Run began: Mon Oct 20 11:37:40 2014

    Record Size 4096 KB
    File size set to 10485760 KB
    Command line used: /mnt/lustre03/testdir/iozone.x86_64 -i 0 -r 4M -s 10G -t 1
    Output is in Kbytes/sec
    Time Resolution = 0.000001 seconds.
    Processor cache size set to 1024 Kbytes.
    Processor cache line size set to 32 bytes.
    File stride size set to 17 * record size.
    Throughput test with 1 process
    Each process writes a 10485760 Kbyte file in 4096 Kbyte records

    Children see throughput for  1 initial writers     =  262971.44 KB/sec
    Parent sees throughput for  1 initial writers     =  261413.53 KB/sec
    Min throughput per process             =  262971.44 KB/sec
    Max throughput per process             =  262971.44 KB/sec
    Avg throughput per process             =  262971.44 KB/sec
    Min xfer                     = 10485760.00 KB

    Children see throughput for  1 rewriters     =  278931.97 KB/sec
    Parent sees throughput for  1 rewriters     =  275981.64 KB/sec
    Min throughput per process             =  278931.97 KB/sec
    Max throughput per process             =  278931.97 KB/sec
    Avg throughput per process             =  278931.97 KB/sec
    Min xfer                     = 10485760.00 KB



iozone test complete.

OSC stats:

lctl set_param osc.*.rpc_stats=clear
sudo less /proc/fs/lustre/osc/*/rpc_stats

snapshot_time:         1413802038.44872 (secs.usecs)
read RPCs in flight:  0
write RPCs in flight: 0
pending write pages:  0
pending read pages:   0

                        read                    write
pages per rpc         rpcs   % cum % |       rpcs   % cum %
1:                       0   0   0   |          0   0   0

                        read                    write
rpcs in flight        rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          0   0   0

                        read                    write
offset                rpcs   % cum % |       rpcs   % cum %
0:                       0   0   0   |          0   0   0

This does not look right, is this what you expected?

Collectl output during the IOR run detailed above

[joe59240@ws250 ~]$ ssh cs04r-sc-serv-68
Last login: Tue Oct 14 13:12:09 2014 from ws250.diamond.ac.uk
[joe59240@cs04r-sc-serv-68 ~]$
[joe59240@cs04r-sc-serv-68 ~]$
[joe59240@cs04r-sc-serv-68 ~]$
[joe59240@cs04r-sc-serv-68 ~]$ collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   0   0   587    841      0      0      0      0      9     50     11      57
   0   0   703    896      0      0      0      0      1      9      1       9
   3   3 19403   7052      0      0     44      3      0      3      1       3
   4   4 21818   7536      0      0      0      0    720   8211 208771   25457
   6   6 32816  11121      0      0      0      0    693   8223 219066   26867
   7   7 93271 138243      0      0      0      0   1340  15870 425733   52238
  13  13  307K 548008      0      0      0      0   1486  17641 471197   57778
  14  14  272K 470931      0      0      0      0   1127  13389 350671   42719
   5   5 27538   8991      0      0     64      2   2297  27343 716670   86940
   7   7 41288  12778      0      0    144     14   1247  14809 391712   48057
   7   6 32532  12251      0      0  19664   1633   1606  18808 503272   60941
   5   3 19356   8981      0      0    768    108   1754  17422 486652   58989
  11   9 47614  15398      0      0    220     20    478   3899 113687   13848
   3   3 17766   5328      0      0    308     65   1649  16127 469160   57100
   3   3 14610   5470      0      0      0      0   1370  15896 479407   57970
   4   4 18595   6688      0      0      0      0    115   1324  40295    4864
   3   3 13522   5252      0      0      0      0   1240  14202 449454   54485
   7   6 30853  11003      0      0     36      3    168   1934  60959    7372
   6   6 29337  10556      0      0     12      2   1320  15217 467725   56887
  10  10 51858  14404      0      0      0      0   1284  14771 456003   55435
   1   1  6909   2779      0      0      0      0   1836  21275 637351   77108
   6   5 25279   7865      0      0      0      0    620   7220 216076   26096
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   5   5 23565   7832      0      0      0      0   1160  13788 364409   45931
   5   5 26097   8397      0      0     12      3    791   9396 247501   29920
   7   7 34057  11116      0      0      0      0    984  11696 310894   37928
   2   2 11622   4540      0      0      0      0   1974  23490 614568   74517
   7   7 32042  10125      0      0      0      0    528   6261 167598   20500
   4   4 19701   6609      0      0      0      0    927  11016 292406   35490
   5   5 27505   8451      0      0     12      3   1161  13807 362674   44136
   5   5 20214   6601      0      0      0      0   1212  14407 379212   46230
   3   3 18082   6196      0      0     12      2   1124  13354 352720   42823
   9   9 57738  55049      0      0      0      0    584   6904 184015   22283
  11  11 68865  63861      0      0      0      0 471477  61694 143164   36098
  11  11 73120  64607      0      0     16      4 864545 104359   2977   35145
  11  11 74268  65426      0      0      0      0 887446 107283   3057   36073
  11  11 79167  65613      0      0      0      0 889100 107515   3075   36156
  11  11 81678  66742      0      0      0      0 907457 109708   3125   36879
  11  11 79362  64898      0      0      0      0 909729 109926   3133   36973
  11  11 70470  63635      0      0    280      5 903671 109274   3112   36729
  11  11 59786  61302      0      0      0      0 887830 107275   3058   36083
  11  11 61463  63516      0      0      0      0 849486 102583   2925   34523
  11  11 60779  62515      0      0     20      2 855909 102715   2949   34798
  11  10 59548  61130      0      0     76      7 866513 103751   2985   35231
   2   2 11908  12079      0      0      0      0 837783 100208   2886   34058
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   0   0   577    858      0      0      0      0 448334  53644   1545   18230
Ouch!
[joe59240@cs04r-sc-serv-68 ~]$

IOZONE as detailed above
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   0   0   604    849      0      0      0      0      2     15      2      15
   2   1 14781  11683      0      0      0      0      0      6      0       3
   3   2 18669   6212      0      0      0      0    104    464   8300    1266
   2   1 15306   5297      0      0      0      0    889  10178 287334   35605
   2   1 12186   4372      0      0      0      0    311   3428  70424    9940
   7   6 39800  11164      0      0      0      0    519   5806 137612   18353
   9   8 45085  13230      0      0     16      3    687   7710 202694   26120
   8   8 40120  12531      0      0      0      0   2071  23899 664566   82360
   8   7 38068  11933      0      0      0      0   1559  17932 496415   61911
   4   3 21977   7444      0      0      0      0   1464  16867 457643   57327
  10   9 48323  13548      0      0      0      0    725   8232 203054   26327
  11   9 54483  11768      0      0      0      0   1544  17777 481140   60172
   3   2 17697   6603      0      0     88     10   1936  22289 630640   77987
   8   7 38108  12013      0      0      0      0    573   6414 154170   20554
   6   6 31520   8285      0      0      0      0   1185  13405 354637   45058
   5   5 23790   7108      0      0      0      0   1379  16123 464958   56450
  11  11 52229  14343      0      0      0      0   1049  12194 358767   43469
   8   8 37343  12483      0      0     56      8   1929  22733 618107   74940
   8   8 39677  12878      0      0      0      0   1785  20927 601368   73346
   8   8 33446  10767      0      0      0      0   1722  20159 574514   70234
   1   1  6889   2622      0      0     12      2   1517  17917 488105   59922
   7   7 32176  10498      0      0      0      0    801   9522 254860   31226
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
  10  10 47869  13378      0      0      8      2    762   9037 241801   29691
   2   2 10315   3874      0      0      0      0   2244  26580 700204   85859
   6   6 30827  10660      0      0      0      0   1038  12425 324463   39769
  10  10 44931  13088      0      0      0      0    925  10994 289343   35644
   5   5 19976   6395      0      0      0      0   2016  23911 631048   77593
   5   5 24151   7682      0      0     12      3    883  10559 277271   34161
   1   1  7841   2976      0      0    316     26   1263  15052 392666   48183
   0   0   596    861      0      0      0      0    705   8315 218015   26460
   2   2 10681   3824      0      0      0      0      1      9      1       7
   2   2 14454   4887      0      0      0      0    159   1838  56364    6718
   2   2 10568   4015      0      0      0      0    716   8390 245314   29486
   1   1  5077   2339      0      0     12      2    387   4493 133290   16215
   1   1  5410   2372      0      0      0      0    267   3095  91959   11176
   5   5 25443   7448      0      0      0      0    274   3181  94027   11447

With a stripe across all OST's

[joe59240@cs04r-sc-serv-68 dave]$ lfs getstripe /mnt/lustre03/testdir/dave//mnt/lustre03/testdir/dave/
stripe_count:   -1 stripe_size:    1048576 stripe_offset:  -1
/mnt/lustre03/testdir/dave//dd-test
lmm_stripe_count:   30
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  4
    obdidx         objid         objid         group
         4          52196121        0x31c7319                 0
        26          53264792        0x32cc198                 0
         9          52897292        0x327260c                 0
        17          52485050        0x320dbba                 0
         5          52788704        0x3257de0                 0
        29          52833246        0x3262bde                 0
        14          52785759        0x325725f                 0
        16          52658853        0x32382a5                 0
         0          53020814        0x329088e                 0
        12          52817200        0x325ed30                 0
        18          52858751        0x3268f7f                 0
        24          53058169        0x3299a79                 0
         1          53232395        0x32c430b                 0
        13          52599697        0x3229b91                 0
        21          51907308        0x3180aec                 0
        23          52559058        0x321fcd2                 0
         2          52421528        0x31fe398                 0
         8          52819310        0x325f56e                 0
        20          53108899        0x32a60a3                 0
        28          53012365        0x328e78d                 0
        27          53149873        0x32b00b1                 0
        11          52740508        0x324c19c                 0
        15          53099667        0x32a3c93                 0
         3          53045067        0x329674b                 0
        10          52926727        0x3279907                 0
        22          52342894        0x31eb06e                 0
         6          51948916        0x318ad74                 0
        25          52317516        0x31e4d4c                 0
         7          52712325        0x3245385                 0
        19          52586950        0x32269c6                 0

/mnt/lustre03/testdir/dave//ior_dat
lmm_stripe_count:   30
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  29
    obdidx         objid         objid         group
        29          53301435        0x32d50bb                 0
         1          53749983        0x33428df                 0
        13          53083696        0x329fe30                 0
        25          52799678        0x325a8be                 0
        26          53740721        0x33404b1                 0
         5          53271989        0x32cddb5                 0
         8          53317095        0x32d8de7                 0
        21          52409690        0x31fb55a                 0
         0          53502583        0x3306277                 0
        18          53357367        0x32e2b37                 0
         9          53380702        0x32e865e                 0
        22          52839422        0x32643fe                 0
        19          53082855        0x329fae7                 0
        14          53271785        0x32cdce9                 0
         4          52675556        0x323c3e4                 0
        24          53559173        0x3313f85                 0
         3          53541604        0x330fae4                 0
         2          52917827        0x3277643                 0
        23          53061600        0x329a7e0                 0
        20          53580372        0x3319254                 0
        17          52972922        0x3284d7a                 0
        28          53499676        0x330571c                 0
        12          53298632        0x32d45c8                 0
        10          53413001        0x32f0489                 0
        15          53601715        0x331e5b3                 0
         6          52454067        0x32062b3                 0
         7          53188708        0x32b9864                 0
        16          53136154        0x32acb1a                 0
        11          53217905        0x32c0a71                 0
        27          53634590        0x332661e                 0

[joe59240@cs04r-sc-serv-68 dave]$

IOR

[joe59240@cs04r-sc-serv-68 ~]$ collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   0   0   562    836      0      0      0      0     28    260     40     334
   3   3 21068   7563      0      0      0      0      0      3      0       3
   8   8 44555  13754      0      0      0      0    294   2986  70214    8874
   8   8 44343  13682      0      0      0      0   2007  23763 639672   77159
   8   8 45285  13995      0      0      0      0   2039  24127 648836   78164
   8   8 45720  14187      0      0     12      2   2035  24058 652145   78585
   8   8 45833  14234      0      0      0      0   2073  24508 663974   80200
  10  10 46715  13982      0      0     76      2   2079  24595 667395   80519
  11  11 48859  13950      0      0      0      0   2080  24630 664216   80456
  11  11 49677  13997      0      0      0      0   2065  24445 659500   80140
  11  11 49067  14193      0      0      0      0   2067  24559 663404   80618
  11  11 50743  14452      0      0      0      0   2081  24507 664379   80749
  10  10 48587  14081      0      0      0      0   2074  24529 667448   80881
  11  11 48505  14009      0      0      0      0   2088  24710 669080   81231
  11  11 48426  13948      0      0    280      7   2044  24163 654650   79560
  10  10 44231  12839      0      0      0      0   2065  24471 659625   80075
  10  10 44696  12704      0      0      0      0   1936  22903 617964   76287
   9   9 38918  11698      0      0     80      7   1872  22160 595290   74359
   4   4  1719    869      0      0      0      0   1888  22392 600464   75115
   4   4  1781    838      0      0     84     19    414   4653 123005   15382
   0   0   943    895      0      0      0      0      0      4      0       1
Ouch!
[joe59240@cs04r-sc-serv-68 ~]$

IOZONE

[joe59240@cs04r-sc-serv-68 ~]$ collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
   1   0 14184  26608      0      0      0      0     29    304    113     296
  10   8 43543  18637      0      0      0      0     17    177     73     170
  19  18 81349  21661      0      0      0      0    408   3841  83298   11154
  20  18 80810  21536      0      0     32      3   3189  37475 948788  119026
  19  18 81103  21786      0      0      0      0   3195  37515 951536  118957
  20  18 81443  21744      0      0      0      0   3206  37670 953980  119522
  20  18 81622  21652      0      0      0      0   3232  37957 955683  118725
  20  18 80380  21795      0      0      0      0   3209  37702 955052  119159
  19  18 75353  20554      0      0      0      0   3199  37532 959613  120015
  18  18 71471  18827      0      0     12      2   3071  35958 920918  114773
  19  19 74618  19167      0      0      0      0   2942  34714 934244  114394
  18  18 73930  19441      0      0      0      0   3008  35638 957373  117057
  13  13 52541  13865      0      0   1712     54   3046  36284 969794  118585
   0   0   594    858      0      0      0      0   3124  36967 979256  119706
   2   2 17548   5773      0      0      0      0     71    610  14240    1783
  12  12 79256  23327      0      0      0      0      0      5      0       3
  11  11 81920  24226      0      0      0      0   3207  38045   998K  124450
  12  11 82623  24198      0      0     12      2   3773  44736  1169K  145615
  11  11 82375  24033      0      0      0      0   3859  45852  1185K  147588
  12  11 82640  23988      0      0      0      0   3872  46075  1181K  147056
  12  12 85661  25074      0      0     76      7   3870  45983  1186K  147492
  12  12 86445  24981      0      0      0      0   3939  46864  1202K  149635
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut
  11  11 80767  23403      0      0      0      0   4071  48526  1231K  153274
   5   5 46652  12278      0      0    164     22   4018  48240  1217K  151549
   0   0   558    797      0      0      0      0   3330  39112 992456  248660
   0   0   594    866      0      0      0      0      1      9      0       5
   0   0   598    853      0      0      0      0      0      3      0       2
   0   0   780   1077      0      0     24      5      0      3      0       1
   0   0   609    814      0      0      0      0      4     12      3      11
   0   0   743    975      0      0     40      8      0      7      1       6
  54  54  192K 547874      0      0      0      0      1      7      1       6
  97  97  381K  1096K      0      0      0      0     13     26     10      27
  25  25  222K 506925      0      0      0      0      0      3      0       2
Ouch!
[joe59240@cs04r-sc-serv-68 ~]$

Mem and CPU info

[joe59240@cs04r-sc-serv-68 ~]$ cat /proc/meminfo
MemTotal:       65890040 kB
MemFree:        30684456 kB
Buffers:          367320 kB
Cached:         29115520 kB
SwapCached:            0 kB
Active:         10533044 kB
Inactive:       19325756 kB
Active(anon):     375532 kB
Inactive(anon):      708 kB
Active(file):   10157512 kB
Inactive(file): 19325048 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097144 kB
SwapFree:        2097144 kB
Dirty:                84 kB
Writeback:             0 kB
AnonPages:        376112 kB
Mapped:            42724 kB
Shmem:               208 kB
Slab:            4025664 kB
SReclaimable:     550248 kB
SUnreclaim:      3475416 kB
KernelStack:        7128 kB
PageTables:        10824 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    35042164 kB
Committed_AS:     767780 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      956128 kB
VmallocChunk:   34324855240 kB
HardwareCorrupted:     0 kB
AnonHugePages:    266240 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        4992 kB
DirectMap2M:     2013184 kB
DirectMap1G:    65011712 kB

CPU:

[joe59240@cs04r-sc-serv-68 ~]$ cat /proc/cpuinfo
...
processor    : 23
vendor_id    : GenuineIntel
cpu family    : 6
model        : 45
model name    : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
stepping    : 7
cpu MHz        : 2299.974
cache size    : 15360 KB
physical id    : 1
siblings    : 12
core id        : 5
cpu cores    : 6
apicid        : 43
initial apicid    : 43
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips    : 4599.34
clflush size    : 64
cache_alignment    : 64
address sizes    : 46 bits physical, 48 bits virtual
power management:

Check debug is off

[joe59240@cs04r-sc-serv-68 dave]$ lctl get_param debug
debug=ioctl neterror warning error emerg ha config console lfsck
[joe59240@cs04r-sc-serv-68 dave]$

I will have a go with checksums but this is not how I would want to run with real data.

Comment by Dave Bond (Inactive) [ 23/Oct/14 ]

Hello,

Could I have an update on this. We are wanting to update some production machines to 2.6 and before we do that I would love to prove it can go as fast as you say.

Comment by Jinshan Xiong (Inactive) [ 23/Oct/14 ]

Hi Dave,

I edited your comment so that it's easier to see. Please let me know if I happened to delete some important information.

Comment by Jinshan Xiong (Inactive) [ 23/Oct/14 ]

Hi Dave,

Thanks for the result. Unfortunately I didn't get much useful information. Can you please perform the test again for me in the following steps:

Before the test starts:
1. lctl set_param debug=0
2. set directory stripe count to 1, if you're going to reuse dave directory then this is not needed. I would suggest to remove all files under that directory before test starts
3. lctl set_param osc.*.rpc_stats=clear
4. start collectl as: collectl -scml |tee -a collectl.txt, this will generate a regular file collectl.txt, and please attach that file here
5. monitor rpc_stats by: watch -n 1 "lctl get_param osc.*.rpc_stats |tee -a rpc_stats.txt", this will generate a regular file rpc_stats.txt, please attach it here

6. Run iozone command: sudo /mnt/lustre03/testdir/iozone.x86_64 -i 0 -r 4M -s 10G -t 1 -w -f /mnt/lustre03/testdir/dave/iozone

7. once the above command is finished, please attach stats files and show me the result of: lfs getstripe /mnt/lustre03/testdir/dave/iozone

Thanks again.

I'd like to know some detail information about the OSS/OST and network configuration.
1. Single OST raw IO performance and configuration. Have you ever run obdfilter-survey on OSTs and what is the output?
2. Network type. I understand you're using ethernet here, have you ever run lnet-lselftest and the output if you did?

Jinshan

Comment by Dave Bond (Inactive) [ 24/Oct/14 ]
[joe59240@cs04r-sc-serv-68 ~]$ sudo lctl set_param debug=0
debug=0

New directory rather than empty the existing one

[joe59240@cs04r-sc-serv-68 ~]$ sudo mkdir /mnt/lustre03/testdir/dave1
[joe59240@cs04r-sc-serv-68 ~]$ sudo lfs setstripe -c 1 /mnt/lustre03/testdir/dave/1
[joe59240@cs04r-sc-serv-68 ~]$ sudo lctl set_param osc.*.rpc_stats=clear
osc.lustre03-OST0000-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0001-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0002-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0003-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0004-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0005-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0006-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0007-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0008-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0009-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000a-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000b-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000c-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000d-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000e-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST000f-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0010-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0011-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0012-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0013-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0014-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0015-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0016-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0017-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0018-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST0019-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST001a-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST001b-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST001c-osc-ffff880828d65400.rpc_stats=clear
osc.lustre03-OST001d-osc-ffff880828d65400.rpc_stats=clear
osc.play01-OST0000-osc-ffff8807825c1800.rpc_stats=clear
osc.play01-OST0001-osc-ffff8807825c1800.rpc_stats=clear
osc.play01-OST0002-osc-ffff8807825c1800.rpc_stats=clear
osc.play01-OST0003-osc-ffff8807825c1800.rpc_stats=clear
osc.play01-OST0004-osc-ffff8807825c1800.rpc_stats=clear
osc.play01-OST0005-osc-ffff8807825c1800.rpc_stats=clear
[joe59240@cs04r-sc-serv-68 ~]$  /mnt/lustre03/testdir/iozone.x86_64 -i 0 -r 4M -s 10G -t 1 -w -F /mnt/lustre03/testdir/dave1/iozone
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.283 $
		Compiled for 64 bit mode.
		Build: linux 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, 
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy,
	             Erik Habbinga, Kris Strecker, Walter Wong.

	Run began: Fri Oct 24 09:05:21 2014

	Record Size 4096 KB
	File size set to 10485760 KB
	Setting no_unlink
	Command line used: /mnt/lustre03/testdir/iozone.x86_64 -i 0 -r 4M -s 10G -t 1 -w -F /mnt/lustre03/testdir/dave1/iozone
	Output is in Kbytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 1024 Kbytes.
	Processor cache line size set to 32 bytes.
	File stride size set to 17 * record size.
	Throughput test with 1 process
	Each process writes a 10485760 Kbyte file in 4096 Kbyte records

	Children see throughput for  1 initial writers 	=  337159.12 KB/sec
	Parent sees throughput for  1 initial writers 	=  335492.23 KB/sec
	Min throughput per process 			=  337159.12 KB/sec 
	Max throughput per process 			=  337159.12 KB/sec
	Avg throughput per process 			=  337159.12 KB/sec
	Min xfer 					= 10485760.00 KB

	Children see throughput for  1 rewriters 	=  338427.53 KB/sec
	Parent sees throughput for  1 rewriters 	=  337059.64 KB/sec
	Min throughput per process 			=  338427.53 KB/sec 
	Max throughput per process 			=  338427.53 KB/sec
	Avg throughput per process 			=  338427.53 KB/sec
	Min xfer 					= 10485760.00 KB



iozone test complete.
[joe59240@cs04r-sc-serv-68 ~]$ 
Comment by Dave Bond (Inactive) [ 24/Oct/14 ]

Stats from the iozone test attached

Comment by Dave Bond (Inactive) [ 24/Oct/14 ]
[joe59240@cs04r-sc-serv-68 ~]$ sudo lfs getstripe /mnt/lustre03/testdir/dave1/iozone
/mnt/lustre03/testdir/dave1/iozone
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  24
	obdidx		 objid		 objid		 group
	    24	      53921714	    0x336c7b2	             0

[joe59240@cs04r-sc-serv-68 ~]$ 
Comment by Dave Bond (Inactive) [ 24/Oct/14 ]

For the last two questions I will get back to you I will be talking with Frederik about that.

Regards
Dave

Comment by Dave Bond (Inactive) [ 24/Oct/14 ]
Fri Jun  3 09:17:39 BST 2011 Obdfilter-survey for case=disk from cs04r-sc-oss03-01.diamond.ac.uk
ost 30 sz 503316480K rsz 1024K obj   30 thr   30 write 1927.85 [  37.96,  70.94] rewrite 1913.93 [  23.97,  75.93] read 4890.09 [ 132.87, 186.82] 
ost 30 sz 503316480K rsz 1024K obj   30 thr   60 write 3563.12 [  65.94, 141.86] rewrite 3688.63 [  46.95, 136.86] read 8823.83 [ 264.74, 406.60] 
ost 30 sz 503316480K rsz 1024K obj   30 thr  120 write 6530.20 [ 126.75, 268.74] rewrite 6868.42 [ 104.91, 262.74] read 11557.70 [ 366.64, 523.96] 
ost 30 sz 503316480K rsz 1024K obj   30 thr  240 write 8533.28 [ 132.86, 363.64] rewrite 8665.87 [ 139.86, 398.61] read 11666.15 [ 377.25, 536.94] 
ost 30 sz 503316480K rsz 1024K obj   30 thr  480 write 8724.26 [ 110.90, 381.62] rewrite 8651.21 [  73.93, 497.51] read 11661.70 [ 369.63, 542.51] 
ost 30 sz 503316480K rsz 1024K obj   60 thr   60 write 3447.86 [  56.94, 139.87] rewrite 3343.69 [  67.93, 132.88] read 9385.09 [ 281.72, 362.28] 
ost 30 sz 503316480K rsz 1024K obj   60 thr  120 write 5872.07 [  84.92, 238.75] rewrite 5625.11 [ 122.88, 239.76] read 11584.47 [ 349.32, 534.47] 
ost 30 sz 503316480K rsz 1024K obj   60 thr  240 write 8098.56 [  97.91, 381.63] rewrite 7987.45 [  80.92, 356.64] read 11582.54 [ 306.70, 536.62] 
ost 30 sz 503316480K rsz 1024K obj   60 thr  480 write 8625.35 [ 136.88, 429.19] rewrite 8750.36 [ 106.90, 448.56] read 11316.42 [ 309.73, 642.37] 
ost 30 sz 503316480K rsz 1024K obj  120 thr  120 write 3600.37 [  59.94, 186.82] rewrite 3522.46 [  72.86, 153.72] read 4946.79 [ 146.72, 249.76] 
ost 30 sz 503316480K rsz 1024K obj  120 thr  240 write 6653.74 [  65.94, 321.70] rewrite 6498.17 [ 128.74, 295.73] read 5862.56 [ 128.87, 261.74] 
ost 30 sz 503316480K rsz 1024K obj  120 thr  480 write 8690.26 [ 116.89, 410.58] rewrite 8368.97 [ 129.74, 386.62] read 9619.68 [ 201.80, 435.70] 
ost 30 sz 503316480K rsz 1024K obj  240 thr  240 write 3582.52 [  72.93, 170.82] rewrite 3579.80 [  61.95, 164.68] read 4713.35 [ 114.89, 188.82] 
ost 30 sz 503316480K rsz 1024K obj  240 thr  480 write 6536.34 [ 111.89, 297.70] rewrite 6391.06 [  92.91, 272.73] read 5064.96 [ 145.86, 199.82] 
ost 30 sz 503316480K rsz 1024K obj  480 thr  480 write 3624.81 [  91.91, 200.80] rewrite 3604.54 [  70.99, 190.81] read 4713.24 [  97.90, 190.81] 
Comment by Frederik Ferner (Inactive) [ 24/Oct/14 ]

Note, the obdfilter-survey output Dave posted earlier had been taken a few years ago when we first commissioned the hardware, since then we have upgraded the server hardware and also changed from Lustre 1.8/RHEL5 to Lustre 2.5/RHEL6. Unfortunately I don't think we have recorded single OST obdfilter-survey output.

We have also recently added IB to our file system. The tests so far had been using dual 10GigE bonded links (LACP) on the client and the same on the OSSes, MTU on the network is 8982. lnet selftest results for this network are below. We have repeated the tests with lnet over IB and the iozone performance hasn't changed, again lnet selftest output below.

lnet selftest between all 4 OSSes in the file system as server and the client over ethernet/tcp:

[bnh65367@cs04r-sc-serv-68 ~]$ sudo /tmp/lnet-selftest-wc.sh -s "172.23.144.31@tcp 172.23.144.32@tcp 172.23.144.33@tcp 172.23.144.34@tcp" -c 172.23.134.68@tcp -k 1 -r start
CONCURRENCY=32
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
172.23.134.68@tcp are added to session
172.23.144.31@tcp are added to session
172.23.144.32@tcp are added to session
172.23.144.33@tcp are added to session
172.23.144.34@tcp are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4
	Test 1(brw) (loop: 1800000, concurrency: 32)
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4
	Test 2(brw) (loop: 1800000, concurrency: 32)
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4

[LNet Rates of c]
[R] Avg: 9156     RPC/s Min: 9156     RPC/s Max: 9156     RPC/s
[W] Avg: 6837     RPC/s Min: 6837     RPC/s Max: 6837     RPC/s
[LNet Bandwidth of c]
[R] Avg: 2320.97  MB/s  Min: 2320.97  MB/s  Max: 2320.97  MB/s
[W] Avg: 2257.24  MB/s  Min: 2257.24  MB/s  Max: 2257.24  MB/s
[LNet Rates of s]
[R] Avg: 1976     RPC/s Min: 1409     RPC/s Max: 2487     RPC/s
[W] Avg: 2556     RPC/s Min: 1752     RPC/s Max: 3293     RPC/s
[LNet Bandwidth of s]
[R] Avg: 568.28   MB/s  Min: 408.74   MB/s  Max: 708.79   MB/s
[W] Avg: 604.67   MB/s  Min: 365.93   MB/s  Max: 827.75   MB/s
[LNet Rates of c]
[R] Avg: 9195     RPC/s Min: 9195     RPC/s Max: 9195     RPC/s
[W] Avg: 6876     RPC/s Min: 6876     RPC/s Max: 6876     RPC/s
[LNet Bandwidth of c]
[R] Avg: 2321.21  MB/s  Min: 2321.21  MB/s  Max: 2321.21  MB/s
[W] Avg: 2276.78  MB/s  Min: 2276.78  MB/s  Max: 2276.78  MB/s
[LNet Rates of s]
[R] Avg: 2019     RPC/s Min: 1391     RPC/s Max: 2628     RPC/s
[W] Avg: 2600     RPC/s Min: 1715     RPC/s Max: 3471     RPC/s
[LNet Bandwidth of s]
[R] Avg: 570.68   MB/s  Min: 397.51   MB/s  Max: 755.95   MB/s
[W] Avg: 629.91   MB/s  Min: 366.12   MB/s  Max: 889.95   MB/s

lnet selftest for the same servers but now using IB/o2ib:

[bnh65367@cs04r-sc-serv-68 ~]$ sudo /tmp/lnet-selftest-wc.sh -s "10.144.144.31@o2ib 10.144.144.32@o2ib 10.144.144.33@o2ib 10.144.144.34@o2ib" -c 10.144.134.68@o2ib -k 1 -r start
CONCURRENCY=32
SESSION: hh FEATURES: 0 TIMEOUT: 100000 FORCE: No
10.144.134.68@o2ib are added to session
10.144.144.31@o2ib are added to session
10.144.144.32@o2ib are added to session
10.144.144.33@o2ib are added to session
10.144.144.34@o2ib are added to session
Test was added successfully
Test was added successfully
b is running now
Batch: b Tests: 2 State: 177
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4
	Test 1(brw) (loop: 1800000, concurrency: 32)
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4
	Test 2(brw) (loop: 1800000, concurrency: 32)
	ACTIVE	BUSY	DOWN	UNKNOWN	TOTAL
client	1	0	0	0	1
server	4	0	0	0	4

[LNet Rates of c]
[R] Avg: 19354    RPC/s Min: 19354    RPC/s Max: 19354    RPC/s
[W] Avg: 9678     RPC/s Min: 9678     RPC/s Max: 9678     RPC/s
[LNet Bandwidth of c]
[R] Avg: 4776.00  MB/s  Min: 4776.00  MB/s  Max: 4776.00  MB/s
[W] Avg: 4902.20  MB/s  Min: 4902.20  MB/s  Max: 4902.20  MB/s
[LNet Rates of s]
[R] Avg: 4430     RPC/s Min: 4335     RPC/s Max: 4546     RPC/s
[W] Avg: 5624     RPC/s Min: 5524     RPC/s Max: 5755     RPC/s
[LNet Bandwidth of s]
[R] Avg: 1225.72  MB/s  Min: 1214.89  MB/s  Max: 1239.21  MB/s
[W] Avg: 1525.16  MB/s  Min: 1480.58  MB/s  Max: 1563.76  MB/s
[LNet Rates of c]
[R] Avg: 19354    RPC/s Min: 19354    RPC/s Max: 19354    RPC/s
[W] Avg: 9677     RPC/s Min: 9677     RPC/s Max: 9677     RPC/s
[LNet Bandwidth of c]
[R] Avg: 4773.75  MB/s  Min: 4773.75  MB/s  Max: 4773.75  MB/s
[W] Avg: 4906.15  MB/s  Min: 4906.15  MB/s  Max: 4906.15  MB/s
[LNet Rates of s]
[R] Avg: 4479     RPC/s Min: 4350     RPC/s Max: 4640     RPC/s
[W] Avg: 5672     RPC/s Min: 5532     RPC/s Max: 5852     RPC/s
[LNet Bandwidth of s]
[R] Avg: 1226.72  MB/s  Min: 1219.92  MB/s  Max: 1241.57  MB/s
[W] Avg: 1520.53  MB/s  Min: 1474.58  MB/s  Max: 1571.82  MB/s
Comment by Jinshan Xiong (Inactive) [ 24/Oct/14 ]

Hi Dave,

Thanks for the testing.

From the obdfilter-survey result, the OST can see maximum performance at 16 threads write. It may become better with more threads.

From the result of iozone, single stripe write performance was at 337MB/s, which roughly matches the obdfilter-survey performance at 8 threads write. So I guess you're setting OSC max_rpcs_in_flight to 8, try to increase it to 16(lctl set_param osc.*.max_rpcs_in_flight=16) and see how it goes. Monitoring rpc_stats will clarify the case, so please do it as I said in step 5.

In the next step, please run lnet_selftest to verify that the network performance meets you expectation; then increase the stripe count to 2 and see how it goes.

Comment by Jinshan Xiong (Inactive) [ 24/Oct/14 ]

Hi Frederik Ferner,

the network seems good. The lower performance on servers side is because there are 4 servers and only 1 client. Have you ever tried the case of 1 client and 1 server nodes?

Comment by Andreas Dilger [ 24/Oct/14 ]

Note that setting only max_rpcs_in_flight doesn't necessarily help if the peer credits isn't also increased. See LU-3184 for more details.

Comment by Dave Bond (Inactive) [ 27/Oct/14 ]

As expected max_rpcs_in_flight did not give any great improvement. With the thought on peer credits I see I have 8 currently

[joe59240@cs04r-sc-serv-68 ~]$ cat /proc/sys/lnet/peers
nid                      refs state  last   max   rtr   min    tx   min queue
172.23.144.1@tcp            1    NA    -1     8     8     8     8     7 0
10.144.144.1@o2ib           1    NA    -1     8     8     8     8     5 0
172.23.144.14@tcp           1    NA    -1     8     8     8     8     5 0
172.23.144.6@tcp            1    NA    -1     8     8     8     8     6 0
172.23.144.32@tcp           1    NA    -1     8     8     8     8   -56 0
10.144.144.32@o2ib          1    NA    -1     8     8     8     8   -56 0
10.144.134.68@o2ib          1    NA    -1     8     8     8     8     6 0
172.23.134.68@tcp           1    NA    -1     8     8     8     8     6 0
10.144.144.34@o2ib          1    NA    -1     8     8     8     8   -42 0
172.23.144.34@tcp           1    NA    -1     8     8     8     8   -57 0
172.23.144.5@tcp            1    NA    -1     8     8     8     8     7 0
172.23.144.18@tcp           1    NA    -1     8     8     8     8     5 0
10.144.144.31@o2ib          1    NA    -1     8     8     8     8   -48 0
172.23.144.31@tcp           1    NA    -1     8     8     8     8   -56 0
10.144.144.33@o2ib          1    NA    -1     8     8     8     8   -56 0
172.23.144.33@tcp           1    NA    -1     8     8     8     8   -56 0

Should this also be set to 16 to match the max_rpcs_in_flight setting?

Would this also be adjusting the max peers or do all fields have to match?

Comment by Dave Bond (Inactive) [ 27/Oct/14 ]

I would also be appreciative of a man page or an example of how this is set. For example does it take effect immediately or do I need to preform any other steps. I cannot find this information in the 2.x manual.

Comment by Dave Bond (Inactive) [ 30/Oct/14 ]

*NUDGE* Could you please give a little more information on the peer credits before I proceed.

Comment by Jinshan Xiong (Inactive) [ 31/Oct/14 ]

Hi Dave, I have invited our LNET expert, Issac to take a look at this issue and I think he will provide some information so that we can proceed.

Comment by Jinshan Xiong (Inactive) [ 31/Oct/14 ]

From what I have seen so far, another thing we can do is to increase the stripe count to 2 and see if we can get any performance gains.

Comment by Isaac Huang (Inactive) [ 03/Nov/14 ]

I suppose the question was about client and servers connected directly by TCP (i.e. no lnet routers).

The /proc/sys/lnet/peers showed that the queues for the servers grew quite deep at one point, which might be caused by lack of peer_credits or not (e.g. transient network congestion). Try increase peer_credits to match max_rpcs_in_flight. If the Dynamic LNet Config project hasn't yet enabled dynamic peer credits tuning, then it's a ksocklnd option.

Comment by Jinshan Xiong (Inactive) [ 03/Nov/14 ]

Thanks Issac.

Hi Dave,

To actually control peer_credits, please apply `options ksocklnd peer_credits=16' for ksocklnd to match the value of max_pages_per_rpc.

Comment by Dave Bond (Inactive) [ 11/Nov/14 ]

After testing this on out test file system we felt ready to put this into production, and then we could provide some good performance metrics.

We had to roll back the change as on the majority of our cluster nodes they failed to mount the file system

Nov 10 13:06:43 cs04r-sc-com06-40 kernel: LNetError:
1317:0:(o2iblnd_cb.c:2619:kiblnd_rejected()) 10.144.144.1@o2ib rejected:
incompatible message queue depth 16, 8

The lnet.conf file is as below

options lnet networks=o2ib0(ib0),tcp0(bond0)
options ksocklnd peer_credits=16
options ko2iblnd peer_credits=16

So the suggestion above did work on many of the nodes but because of this issue we could not keep the change. Any thoughts as to how to overcome this?

Comment by Liang Zhen (Inactive) [ 13/Nov/14 ]

Hi Dave, I think you need to have same credits for all nodes, so if you changed some credits to 16, then all others have to be 16 as well.

Comment by Dave Bond (Inactive) [ 13/Nov/14 ]

As it has a max and min value, is it really the case that everything has to be set to 16. I would prefer due to the disruption of a live file system if this can be tuned per client. If it has to be set everywhere could you please advise of risk associated with this as I would not want to introduce any oddities at this point in our run time.

Comment by Jinshan Xiong (Inactive) [ 13/Nov/14 ]

Hi Dave,

Is it possible for you to set up a test environment to verify the performance gains by this change? It won't need a lot of nodes - 1 client, 2 OSS and 1MDS would be enough for now.

Right now, we're still in the phase of identifying the problem. We may have more experiments to do down the road. Having a testing env will be really helpful and accelerate the progress.

Comment by Isaac Huang (Inactive) [ 17/Nov/14 ]

Dave,

For now all peer_credits must be the same (for each LND) every where. I thought you intended to increase peer_credits only for the TCP network, as the TCP peers showed deep tx queues. For TCP I see no risk to double the peer_credits. It's more complicated for the o2iblnd, as it wouldn't suffice to increase peer_credits alone in order to increase the send window. It's OK to change peer_credits for TCP only.

There's a patch to enable per-client tuning for o2iblnd, but it's highly experimental:
http://review.whamcloud.com/#/c/11794/

Comment by Dave Bond (Inactive) [ 19/Nov/14 ]

Hello all,

As this is looking to be more disruptive than expected for the production system we are running this on, as we would need to change peer credits on all clients. For stability reasons this does not feel a good idea to us.

We are going to be using an alternative smaller file system, still running the same versions of lustre.

Over the next few days we will be benchmarking this to ensure that we are not already saturating the disks or servers. If this is successful, we will introduce the peer credits change and note any performance gain. I would hope if we are going to benefit we will see a smaller increment in the performance compared with the production file system.

Comment by Dave Bond (Inactive) [ 05/Dec/14 ]

Hello,

With the current configuration I am going from 8 to 16 max_rpc_in_flight and peer credits. Talking to others at MEW this year numbers such as 64 have been mentioned as being used.

Could you please explain how the 16 was derived? Why not for example double it again to 32?
Other than how it is derived I am interested in the effect as well.

Comment by Liang Zhen (Inactive) [ 21/Dec/14 ]

Hi Dave, 16 is just a value we'd suggest to try with, if it cannot help then you might want to increase it.
But we normally don't suggest to give it a very big value because the bigger it is, more resources it consumes.

Comment by Jinshan Xiong (Inactive) [ 08/Feb/18 ]

close old tickets

Generated at Sat Feb 10 01:54:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.